The Elements of Statistical Learning

Vocabulary

Expected value - a concept close to what we think of as "the mean"
regularized
non-singular problem
singular
endogenous, exogenous
generalizability - prediction capability on independent test data
bagging - bootstrap aggregation; reduces variance of an estimated prediction function.

Generalization - criteria of how we assess model, guides the choice of learning method/model, gives us a measure of the quality
- A model with zero training error is overfit to the training data and will typically generalize poorly

parameters
- Y = target variable
- X = vector of inputs
- ${\hat {f}}(X)$ prediction model/regression fit, or ${\hat {f}}_{\alpha }(X)$ , the model as a function of specific tuning parameters...
  - Different from $f(X)$ without the hat in that $Y=f(X)+\epsilon$
- $\alpha$ model tuning parameters
- $\mathrm {T}$ = training set
Optimization problem seeks to minimize a loss function $L(Y,{\hat {f}}(X))$ :
- $(Y-{\hat {f}}(X))^{2}$ - squared error
- $|Y-{\hat {f}}(X)|$ - absolute error
Test error a.k.a. generalization error, prediction error over an independent sample
- ${\textrm {Err}}_{\mathrm {T} }=E[L(Y,{\hat {f}}(X))|\mathrm {T} ]$ , where both X and Y are drawn randomly from their joint distribution (population)
model complexity
- "As the model becomes more complex, it uses the training data more and is more able to adapt to more complicated underlying structures
- Bias-variance tradeoff is a function of model complexity
- As model becomes more complex as learns the training set more, bias is said to decrease while variance is said to increase
Model selection = estimating the performance of different models in order to choose the best one
model assessment = having chosen the final model, estimating prediction error (generalization error) on new data
50-25-25 train-test-validation

$Y=f(X)+\epsilon$ where $E(\epsilon )=0$ and ${\textrm {Var}}(\epsilon )=\sigma _{\epsilon }^{2}$
Derive an expression for the expected prediction error of a regression fit ${\hat {f}}(X)$ at an input point $X=x_{0}$ using squared-error loss:
- ${\textrm {Err}}(x_{0})=E[(Y-{\hat {f}}(x_{0}))^{2}|X=x_{0}]$
- ${\textrm {Err}}(x_{0})=\sigma _{\epsilon }^{2}+[E{\hat {f}}(x_{0})-f(x_{0})]^{2}+E[{\hat {f}}(x_{0})-E{\hat {f}}(x_{0})]^{2}$
- ${\textrm {Err}}(x_{0})=\sigma _{\epsilon }^{2}+{\textrm {Bias}}^{2}({\hat {f}}(x_{0}))^{2}+{\textrm {Var}}({\hat {f}}(x_{0}))$
where:
- 1st term - Irreducible error; the variance of the target around its true mean; cannot be avoided no matter how well we estimate
- 2nd term - The square of the bias; the amount by which the average of our estimate differs from the true mean
- 3rd term - The variance; the expected squared deviation of ${\hat {f}}(x_{0})$ around its mean
For example, in a k-nearest neighbors problem, number of neighbors k is inversely proportional to the model complexity
model bias = error between best fitting function and true function
estimation bias = error between the average estimate and the best-fitting approximation

${\bar {err}}={\frac {1}{N}}\sum _{i=1}^{N}L(y_{i},{\hat {f}}(x_{i}))$ , the definition of the training error
- Will typically be less that the test error since the same data is being used to fit the model and assess the error.
The amount by which ${\bar {err}}$ underestimates the true error depends on how strongly yi affects its own prediction.
For linear estimators, you can use Cp, AIC and BIC
For all estimators in general, use cross-validation and bootstrap methods for direct estimates of extra-sample error.

"In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps.
Exception: initial unsupervised screening steps can be done before samples are left out.
- Example: Select the predictors with the highest variance across all samples before starting x-val. The filtering does not involve the class labels so it doesn't provide an unfair advantage.
The model is completely retrained at each fold in the process

Supervised learning: input: predictor variables, output: response variables
- characterized by Loss function L(y, yhat)=(y=yhat)^2
- density estimation problem: determining properties of conditional density Pr(Y|X)
- want parameters that minimize expected error
Unsupervised learning goal is to estimate joint density Pr(X)
For low dimensional problems, can directly estimate density Pr(X) at all X values
Identify a low dimensional manifold within the X-space that represent high density data
- Provide information about the association among variables and whether or not they can be considered functions of a smaller set of latent variables.
Cluster analysis attempts to find multiple convex regions of the X-space that contain modes of Pr(X)
- This can tell whether or not Pr(X) can be represented by a mixture of simpler densities representing distinct types of classes or observations

potential goals: Estimating K (how many groups?), determining hierarchy
"We don't use labels in the clustering but will examine posthoc which labels fall into which clusters
Estimating within cluster dissimilarity Wsubk as a function of the number of clusters K
- cross-validation doesn't work here because W generally decreases with increasing K
If K is less than K*, then clusters returned by algorithm will each contain a subset of the true underlying groups
As you partition individuals into more clusters than there are in actuality, this will tend to provide a smaller decrease in the criterion as K is further increased.
- Splitting a natural group reduces the criterion less than partitioning the union of two well-separated groups.
Gap Statistic: identify a kink in the plot of Wk = f(K)

Bootstrap aggregation works well for high variance low bias procedures
- regression: average the prediction
- classification: the predictions form a committee and vote
Random forests builds a large collection of de-correlated trees and averages them
Average many noisy but approximately unbiased models.