Ensemble Learning Methods
Jay Hyer
3 sides to every story - your, mine and the truth.
math - arbitrarily many sides to story, then there's the truth
D - is all possible things - the story - the whole world of all observations D' - what we observed - obverved data f is target or ground truth function H is the hypothesis space, narrow this down to fit phenomena we want to predict, hopefully hypothesis space overlaps observed h sub svm , statistical model
trying to narrow down to single function
come to conclusion with model then pick the one we feel fits the training the data best
Boosting = weighted av, learning the weights BAgging = majority vote Stacking = meta learning
how does our model fit data that we find in the future?
ensemble learning take many different guesses - produce many models - increase coverage over hypothesis set
reduce variance by averaging
can't make chicken soup aout of chicken litter
perhaps your model is biased in a certain way
overkill analytics - without understanding the inherent biases of these models
secret to ensemble learning is diversity augment input data (boosting) resampling(bqagging) different input variables (random forests)
training error - "training error is for training wheels"
learning algorithm produces a learner.
"Weak learner" slightly better than random, simple to produce "strong learner" closely predicts target function, can be expensive ITO CPU, feature engineering
overfit - like adding more polynomials
take many weak learners and combine them to create a strong learner
BOOSTING. Ex: AdaBoost algorithm: (assume classification problem) 1. weight value for every observation, initial values are equal 2. exaggerating the outliers, bias the input originally developed to do binary classificationn decision tree - yes no - kaggle dataset - Titanic dataset
-the outlier is taking a lot of attention - boosting is sensitive to outliers -class imbalance
-overlearning - you teach to the test, not to the subject - the more complicated your model getsthe model fits the training data better, but the model fitting the test error gets parabolic weak learner can't overfit a dataset, it overfits a datapoint, exaggerating the outliers
BAGGING: bootstrapping and aggregating bootstrapping - take sample of observed data - certain observations are repeated, certain observations are left out - arbitrarily bootstrap samples then aggregate bootstrap samples
iterative:= (train algorithm on bootstrap sample) m times - then vote lot ove overlap between datasets unlike adaboost one outcome doesn't affect the next
out of bag error - what didn't get picked like wndchrm -n100
random forests = good for binary, multiclass, or even regression observation runs through the tree and ends at a leaf
STACKING - beyond averaging how do you combine the work of different teams working independings how do you combine different learning algorithms
1. set of base learners - dont change input data by resampling 2. hastie et al 2009 chapter 9 zhou 2012 chapter 2
OUTLIERS
how often did pairwise observation end up at the same node in the tree - a measure of proximity what's an outlier within my class - class imbalance
most datasets are imbalanced - have a very small population for a specific class, ~10% cost of a false negative can be very high my hypothesis set didn't overlap
an ensemble of models created from many downsampled models. 1. downsample = 100,000 observations, thro away most of that data and down to 200, 100 w/, 100 w/o 2. different observations and pair with the positive, use each sample to train a model, train models on balanced sets and combine them.
extreme imbalance -