# The Elements of Statistical Learning

Jump to navigation
Jump to search

## Contents

## Vocabulary

- Expected value - a concept close to what we think of as "the mean"
- regularized
- non-singular problem
- singular
- endogenous, exogenous
- generalizability - prediction capability on independent test data
- bagging - bootstrap aggregation; reduces variance of an estimated prediction function.

## Chapter 7: Model Assessment and Selection

- Generalization - criteria of how we assess model, guides the choice of learning method/model, gives us a measure of the quality
- A model with zero training error is overfit to the training data and will typically generalize poorly

### 7.2: Bias, variance, and model complexity

- parameters
- Y = target variable
- X = vector of inputs
- prediction model/regression fit, or , the model as a function of specific tuning parameters...
- Different from without the hat in that

- model tuning parameters
- = training set

- Optimization problem seeks to minimize a loss function :
- - squared error
- - absolute error

- Test error a.k.a. generalization error, prediction error over an independent sample
- , where both X and Y are drawn randomly from their joint distribution (population)

- model complexity
- "As the model becomes more complex, it uses the training data more and is more able to adapt to more complicated underlying structures
- Bias-variance tradeoff is a function of model complexity
- As model becomes more complex as learns the training set more, bias is said to decrease while variance is said to increase

- Model selection = estimating the performance of different models in order to choose the best one
- model assessment = having chosen the final model, estimating prediction error (generalization error) on new data
- 50-25-25 train-test-validation

### 7.3: Bias-Variance Decomposition

- where and
- Derive an expression for the expected prediction error of a regression fit at an input point using squared-error loss:
- where:
- 1st term - Irreducible error; the variance of the target around its true mean; cannot be avoided no matter how well we estimate
- 2nd term - The square of the bias; the amount by which the average of our estimate differs from the true mean
- 3rd term - The variance; the expected squared deviation of around its mean

- For example, in a k-nearest neighbors problem, number of neighbors k is inversely proportional to the model complexity
- model bias = error between best fitting function and true function
- estimation bias = error between the average estimate and the best-fitting approximation

### 7.4: Optimism of the Training Error Rate

- , the definition of the training error
- Will typically be less that the test error since the same data is being used to fit the model and assess the error.

- The amount by which underestimates the true error depends on how strongly yi affects its own prediction.
- For linear estimators, you can use Cp, AIC and BIC
- For all estimators in general, use cross-validation and bootstrap methods for direct estimates of extra-sample error.

### 7.10: Cross-validation

- "In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps.
- Exception: initial unsupervised screening steps can be done before samples are left out.
- Example: Select the predictors with the highest variance across all samples before starting x-val. The filtering does not involve the class labels so it doesn't provide an unfair advantage.

- The model is completely retrained at each fold in the process

## Chapter 14: Unsupervised Learning

### 14.1: Introduction

- Supervised learning: input: predictor variables, output: response variables
- characterized by Loss function L(y, yhat)=(y=yhat)^2
- density estimation problem: determining properties of conditional density Pr(Y|X)
- want parameters that minimize expected error

- Unsupervised learning goal is to estimate joint density Pr(X)
- For low dimensional problems, can directly estimate density Pr(X) at all X values
- Identify a low dimensional manifold within the X-space that represent high density data
- Provide information about the association among variables and whether or not they can be considered functions of a smaller set of latent variables.

- Cluster analysis attempts to find multiple convex regions of the X-space that contain modes of Pr(X)
- This can tell whether or not Pr(X) can be represented by a mixture of simpler densities representing distinct types of classes or observations

### 14.3 Cluster Analysis

- potential goals: Estimating K (how many groups?), determining hierarchy
- "We don't use labels in the clustering but will examine posthoc which labels fall into which clusters
- Estimating within cluster dissimilarity Wsubk as a function of the number of clusters K
- cross-validation doesn't work here because W generally decreases with increasing K

- If K is less than K*, then clusters returned by algorithm will each contain a subset of the true underlying groups
- As you partition individuals into more clusters than there are in actuality, this will tend to provide a smaller decrease in the criterion as K is further increased.
- Splitting a natural group reduces the criterion less than partitioning the union of two well-separated groups.

- Gap Statistic: identify a kink in the plot of Wk = f(K)

## Chapter 15: Random Forests

- Bootstrap aggregation works well for high variance low bias procedures
- regression: average the prediction
- classification: the predictions form a committee and vote

- Random forests builds a large collection of de-correlated trees and averages them
- Average many noisy but approximately unbiased models.

## Chapter 18: High Dimensional Problems

- High variance and overfitting
- less fitting is better