# The Elements of Statistical Learning

## Vocabulary

• Expected value - a concept close to what we think of as "the mean"
• regularized
• non-singular problem
• singular
• endogenous, exogenous
• generalizability - prediction capability on independent test data
• bagging - bootstrap aggregation; reduces variance of an estimated prediction function.

## Chapter 7: Model Assessment and Selection

• Generalization - criteria of how we assess model, guides the choice of learning method/model, gives us a measure of the quality
• A model with zero training error is overfit to the training data and will typically generalize poorly

### 7.2: Bias, variance, and model complexity

• parameters
• Y = target variable
• X = vector of inputs
• ${\displaystyle {\hat {f}}(X)}$ prediction model/regression fit, or ${\displaystyle {\hat {f}}_{\alpha }(X)}$, the model as a function of specific tuning parameters...
• Different from ${\displaystyle f(X)}$ without the hat in that ${\displaystyle Y=f(X)+\epsilon }$
• ${\displaystyle \alpha }$ model tuning parameters
• ${\displaystyle \mathrm {T} }$ = training set
• Optimization problem seeks to minimize a loss function ${\displaystyle L(Y,{\hat {f}}(X))}$:
• ${\displaystyle (Y-{\hat {f}}(X))^{2}}$ - squared error
• ${\displaystyle |Y-{\hat {f}}(X)|}$ - absolute error
• Test error a.k.a. generalization error, prediction error over an independent sample
• ${\displaystyle {\textrm {Err}}_{\mathrm {T} }=E[L(Y,{\hat {f}}(X))|\mathrm {T} ]}$, where both X and Y are drawn randomly from their joint distribution (population)
• model complexity
• "As the model becomes more complex, it uses the training data more and is more able to adapt to more complicated underlying structures
• Bias-variance tradeoff is a function of model complexity
• As model becomes more complex as learns the training set more, bias is said to decrease while variance is said to increase
• Model selection = estimating the performance of different models in order to choose the best one
• model assessment = having chosen the final model, estimating prediction error (generalization error) on new data
• 50-25-25 train-test-validation

### 7.3: Bias-Variance Decomposition

• ${\displaystyle Y=f(X)+\epsilon }$ where ${\displaystyle E(\epsilon )=0}$ and ${\displaystyle {\textrm {Var}}(\epsilon )=\sigma _{\epsilon }^{2}}$
• Derive an expression for the expected prediction error of a regression fit ${\displaystyle {\hat {f}}(X)}$ at an input point ${\displaystyle X=x_{0}}$ using squared-error loss:
• ${\displaystyle {\textrm {Err}}(x_{0})=E[(Y-{\hat {f}}(x_{0}))^{2}|X=x_{0}]}$
• ${\displaystyle {\textrm {Err}}(x_{0})=\sigma _{\epsilon }^{2}+[E{\hat {f}}(x_{0})-f(x_{0})]^{2}+E[{\hat {f}}(x_{0})-E{\hat {f}}(x_{0})]^{2}}$
• ${\displaystyle {\textrm {Err}}(x_{0})=\sigma _{\epsilon }^{2}+{\textrm {Bias}}^{2}({\hat {f}}(x_{0}))^{2}+{\textrm {Var}}({\hat {f}}(x_{0}))}$
• where:
• 1st term - Irreducible error; the variance of the target around its true mean; cannot be avoided no matter how well we estimate
• 2nd term - The square of the bias; the amount by which the average of our estimate differs from the true mean
• 3rd term - The variance; the expected squared deviation of ${\displaystyle {\hat {f}}(x_{0})}$ around its mean
• For example, in a k-nearest neighbors problem, number of neighbors k is inversely proportional to the model complexity
• model bias = error between best fitting function and true function
• estimation bias = error between the average estimate and the best-fitting approximation

### 7.4: Optimism of the Training Error Rate

• ${\displaystyle {\bar {err}}={\frac {1}{N}}\sum _{i=1}^{N}L(y_{i},{\hat {f}}(x_{i}))}$, the definition of the training error
• Will typically be less that the test error since the same data is being used to fit the model and assess the error.
• The amount by which ${\displaystyle {\bar {err}}}$ underestimates the true error depends on how strongly yi affects its own prediction.
• For linear estimators, you can use Cp, AIC and BIC
• For all estimators in general, use cross-validation and bootstrap methods for direct estimates of extra-sample error.

### 7.10: Cross-validation

• "In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps.
• Exception: initial unsupervised screening steps can be done before samples are left out.
• Example: Select the predictors with the highest variance across all samples before starting x-val. The filtering does not involve the class labels so it doesn't provide an unfair advantage.
• The model is completely retrained at each fold in the process

## Chapter 14: Unsupervised Learning

### 14.1: Introduction

• Supervised learning: input: predictor variables, output: response variables
• characterized by Loss function L(y, yhat)=(y=yhat)^2
• density estimation problem: determining properties of conditional density Pr(Y|X)
• want parameters that minimize expected error
• Unsupervised learning goal is to estimate joint density Pr(X)
• For low dimensional problems, can directly estimate density Pr(X) at all X values
• Identify a low dimensional manifold within the X-space that represent high density data
• Provide information about the association among variables and whether or not they can be considered functions of a smaller set of latent variables.
• Cluster analysis attempts to find multiple convex regions of the X-space that contain modes of Pr(X)
• This can tell whether or not Pr(X) can be represented by a mixture of simpler densities representing distinct types of classes or observations

### 14.3 Cluster Analysis

• potential goals: Estimating K (how many groups?), determining hierarchy
• "We don't use labels in the clustering but will examine posthoc which labels fall into which clusters
• Estimating within cluster dissimilarity Wsubk as a function of the number of clusters K
• cross-validation doesn't work here because W generally decreases with increasing K
• If K is less than K*, then clusters returned by algorithm will each contain a subset of the true underlying groups
• As you partition individuals into more clusters than there are in actuality, this will tend to provide a smaller decrease in the criterion as K is further increased.
• Splitting a natural group reduces the criterion less than partitioning the union of two well-separated groups.
• Gap Statistic: identify a kink in the plot of Wk = f(K)

## Chapter 15: Random Forests

• Bootstrap aggregation works well for high variance low bias procedures
• regression: average the prediction
• classification: the predictions form a committee and vote
• Random forests builds a large collection of de-correlated trees and averages them
• Average many noisy but approximately unbiased models.

## Chapter 18: High Dimensional Problems

• ${\displaystyle p>>N}$
• High variance and overfitting
• less fitting is better