The Elements of Statistical Learning

From Colettapedia
Jump to navigation Jump to search


  • Expected value - a concept close to what we think of as "the mean"
  • regularized
  • non-singular problem
  • singular
  • endogenous, exogenous
  • generalizability - prediction capability on independent test data
  • bagging - bootstrap aggregation; reduces variance of an estimated prediction function.

Chapter 7: Model Assessment and Selection

  • Generalization - criteria of how we assess model, guides the choice of learning method/model, gives us a measure of the quality
    • A model with zero training error is overfit to the training data and will typically generalize poorly

7.2: Bias, variance, and model complexity

  • parameters
    • Y = target variable
    • X = vector of inputs
    • prediction model/regression fit, or , the model as a function of specific tuning parameters...
      • Different from without the hat in that
    • model tuning parameters
    • = training set
  • Optimization problem seeks to minimize a loss function :
    • - squared error
    • - absolute error
  • Test error a.k.a. generalization error, prediction error over an independent sample
    • , where both X and Y are drawn randomly from their joint distribution (population)
  • model complexity
    • "As the model becomes more complex, it uses the training data more and is more able to adapt to more complicated underlying structures
    • Bias-variance tradeoff is a function of model complexity
    • As model becomes more complex as learns the training set more, bias is said to decrease while variance is said to increase
  • Model selection = estimating the performance of different models in order to choose the best one
  • model assessment = having chosen the final model, estimating prediction error (generalization error) on new data
  • 50-25-25 train-test-validation

7.3: Bias-Variance Decomposition

  • where and
  • Derive an expression for the expected prediction error of a regression fit at an input point using squared-error loss:
  • where:
    • 1st term - Irreducible error; the variance of the target around its true mean; cannot be avoided no matter how well we estimate
    • 2nd term - The square of the bias; the amount by which the average of our estimate differs from the true mean
    • 3rd term - The variance; the expected squared deviation of around its mean
  • For example, in a k-nearest neighbors problem, number of neighbors k is inversely proportional to the model complexity
  • model bias = error between best fitting function and true function
  • estimation bias = error between the average estimate and the best-fitting approximation

7.4: Optimism of the Training Error Rate

  • , the definition of the training error
    • Will typically be less that the test error since the same data is being used to fit the model and assess the error.
  • The amount by which underestimates the true error depends on how strongly yi affects its own prediction.
  • For linear estimators, you can use Cp, AIC and BIC
  • For all estimators in general, use cross-validation and bootstrap methods for direct estimates of extra-sample error.

7.10: Cross-validation

  • "In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps.
  • Exception: initial unsupervised screening steps can be done before samples are left out.
    • Example: Select the predictors with the highest variance across all samples before starting x-val. The filtering does not involve the class labels so it doesn't provide an unfair advantage.
  • The model is completely retrained at each fold in the process

Chapter 14: Unsupervised Learning

14.1: Introduction

  • Supervised learning: input: predictor variables, output: response variables
    • characterized by Loss function L(y, yhat)=(y=yhat)^2
    • density estimation problem: determining properties of conditional density Pr(Y|X)
    • want parameters that minimize expected error
  • Unsupervised learning goal is to estimate joint density Pr(X)
  • For low dimensional problems, can directly estimate density Pr(X) at all X values
  • Identify a low dimensional manifold within the X-space that represent high density data
    • Provide information about the association among variables and whether or not they can be considered functions of a smaller set of latent variables.
  • Cluster analysis attempts to find multiple convex regions of the X-space that contain modes of Pr(X)
    • This can tell whether or not Pr(X) can be represented by a mixture of simpler densities representing distinct types of classes or observations

14.3 Cluster Analysis

  • potential goals: Estimating K (how many groups?), determining hierarchy
  • "We don't use labels in the clustering but will examine posthoc which labels fall into which clusters
  • Estimating within cluster dissimilarity Wsubk as a function of the number of clusters K
    • cross-validation doesn't work here because W generally decreases with increasing K
  • If K is less than K*, then clusters returned by algorithm will each contain a subset of the true underlying groups
  • As you partition individuals into more clusters than there are in actuality, this will tend to provide a smaller decrease in the criterion as K is further increased.
    • Splitting a natural group reduces the criterion less than partitioning the union of two well-separated groups.
  • Gap Statistic: identify a kink in the plot of Wk = f(K)

Chapter 15: Random Forests

  • Bootstrap aggregation works well for high variance low bias procedures
    • regression: average the prediction
    • classification: the predictions form a committee and vote
  • Random forests builds a large collection of de-correlated trees and averages them
  • Average many noisy but approximately unbiased models.

Chapter 18: High Dimensional Problems

  • High variance and overfitting
  • less fitting is better