From Colettapedia
Jump to: navigation, search


  • Test MathJax on this page: y=\sum_{i=1}^n x \delta \bar{Xfc}
  • Exploratory Data Analysis - branch of statistics emphasizing visuals, developed by John Tukey
  • Given eqn y = mx + b, dependent variable is , and independent variable is x.
  • Statistical unit = one member of entities being studied. One person in population study, one image in classification problem.
  • Conditional probability - the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the (conditional) probability of A, given B"
  • Joint probability is the probability of both events happening together. The joint probability of A and B is written P(A upsidedown U B), P(AB) or P(A, B)
  • Marginal probability is essentially the opposite of conditional probability. For example, if there are two possible outcomes for X with corresponding events B and B', this means that \scriptstyle P(A) = P(A \cap B) + P(A \cap B^').
  • column rank and row rank
  • degrees of freedom = the number of values in the final calculation that are free to vary.
  • residuals = for each observation residual is the difference between that observation and the average of all the observations.
    • the sum of the residuals is necessarily 0.
  • probabilty mass function = pmf is for for DISCRETE random variables
  • principle of indifference, which assigns equal probabilities to all possibilities.

Error bars

null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

binomial probability

binomial distribution

  • p = probability that event will occur
  • q = probability that event won't occur
  • p and q are complementary = p + q = 1
  • n = number of trials
  • k = number of successes

Binomial approximation

  • standard score = how many standard deviations an observation is above or below the mean.

Tests for Categorical Data

  • Goodness of fit for single categorical variable
    • compare observed counts to the expected counts "contribution terms" for
    • Get relative distance the observed are from the expected
    • Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
    • If null hypothesis is true, observed is close to expected
    • Relative distance the observed are to the expected
    • Test statistic has chi-squared distribution.
 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);

Tests for two-way variables

  • test for homogeneity - distribution of proportions are the same across the populations
  • test of independence -
 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
  • Use fisher's exact test if sample num is small.
    • R: fisher.test(table)
  • cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
  • weight statement indicates the variable in the table


  • "Student's t-distribution"
  • When data are normally distributed
  • Can test hypotheses about the mean/center of the distribution

One-sample t-test

  • Test is mean greater than/less than/equal to some value
  • SAS proc means

Two-Sample t-test

  • Test whether two population means are equal.
  • Unpaired or independent samples t-test: Are the variances the same?
    • If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
    • If yes it's called a "pooled t-test" or "Student's t-test"
    • F-statistic tests whether the variances are equal
  • Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

Nonparametric tests

  • Hypothesis testing when you can't assume data comes from normal distribution
  • a lot of non-parametric approaches are based on ranks
  • do not depend on normality
  • Where as the other test are really tests for means, npar tests are actually for medians

One-sample tests

  • SAS proc univariate for these
  • Sign test
    • Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
    • PAIRED observations with test x > y, x = y, or x < y.
    • Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
    • Does one member of the pair tend to be greater than the other?
    • Does NOT assume symmetric distribution of the differences around the median
    • Does NOT use the magnitude of the difference
  • Wilcoxon Signed Ranks Test
    • A quantitative Sign Test
    • DOES use magnitude of difference of paired observations
  • Confidence interval based on signed rank test
    • what are the set of values for which you wouldn't have rejected the null hypothesis

Two or more sample nonparametric tests

  • Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
  • use deviations for the median and use the signed ranks
  • SAS: proc npar1way wilcoxon
    • Class variable is used for the two or more groups
    • Otherwise use proc npar1way anova
  • Wilcoxon Rank Sum Test/Mann-Whitney U statistic
    • Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
    • Equivalent of unequal variances t-test
    • R: wilcox.test
    • Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
    • Can also do confidence interval
  • Kruskal-Wallis
    • Non-parametric method for testing whether samples originate from the same distribution
    • equivalent to One-way ANOVA

Goodness of fit for continuous distributions

one sample

  • empirical cumulative distribution function, compare to theoretical
    • R: ecdf(data)
  • Kolmogorov-Smirnov
    • Not quite as good, because this just gives a max of the W statistic
  • Do not estimate parameters from the data
  • R: ks.test(x, y="name")


  • Could have two distributions with the same mean but different shapes.
  • R: ks.test(X, Y)

Estimating Parameter Values

  • R: MASS package, fitdistr(data, densfun="exponential")
    • obtain maximum likelihood estimate

Kernel Smoothing Density Function

  • Matlab function
  • [f,xi,u] = ksdensity(x)
  • Computes a probability density estimate of the sample in the vector x
  • f is the vector of density values evaluated at the points xi.
  • u is the width of the kernal -smoothing window, which is calculated.

Linear Discriminant Analysis

Receiver operating characteristic curve

Linear Regression

  • Linear Regression - Wikipedia article
  • y = X beta + epsilon
  • y = the regressand, dependent variable.
  • X = the design matrix. x sub i are regressors
  • each x sub i has a corresponding beta sub i called the intercept
  • beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
  • DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

Regression Diagnostics

  • multicollinearity -> VIF
  • heteroscedasticity -> Scale-Location or Residual vs fitted
  • Outliers -> Residuals vs Leverage or Leverage vs Cook's D
  • Non-linearity -> Residual vs fitted
  • Residual distribution -> Q-Q Plot
  • Understanding Regression Diagnostic Plots
  • R: Use ggfortify ::autoplot

Eigen vector & Eigen Value

Eigen values and eigenvectors

Maximum Likelihood Estimate

Mixed and Multilevel Models

Set theory symbols

  • Set theory symbols
  • \varnothing: \varnothing, empty set
  • \mid: \mid, satisfies the condition
  • \cup: \cup, union
  • \cap: \cap, intersection
  • \setminus: \setminus
  • \triangle: \triangle, symmetric difference
  • \in: \in - left side element is in right side set
  • \cdot: \cdot, dot product, vector and matrix multiplication, scalar result
  • \times: \times, cross product of vectors
  • \otimes: \otimes, kronecker (outer) product of tensor (matrix)


  • p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} = \text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}}
  • In practice there's is only interest in the numerator of that fraction, because the denominator does not depend on C, and the values on feature x_i are given, so the denominator is effectively constant.
  • The numerator is equivalent to the joint probability model
  • If we assume each feature is conditionally independent of every other, then the joint model can be expressed as

p(C_k \mid x_1, \dots, x_n) & \varpropto p(C_k, x_1, \dots, x_n) \\
                            & = p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\
                            & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,,

  • Classifier combines probability model with a decision rule, i.e. maximum a posteriori

Conditional probability

  • What is the probability that a given observation D belongs to a given class C, p(C \mid D)
  • "The probability of A under the condition B" p(A \mid B)
  • There need not be a causal relationship
  • Compare with UNconditional probability p(A)
  • If p(A \mid B) = p( A ), then events are independent, knowledge about either event does not give information on the other. Otherwise, P(A \cap B) = P(A) P(B).
  • Don't falsely equate p(A \mid B) and p(B \mid A)
  • Defined as the quotient of the joint of events A and B and the probability of B: P(A \mid B) = \frac{P(A \cap B)}{P(B)},, where numerator is the probability that both events A and B occur.
  • Joint probability P(A \cap B) = P(A \mid B)P(B)


  • Compare vs. Frequentist
  • Naive Bayes youtube vid
  • Pros:
    • Easy and fast to predict a class of test dataset
    • Naive Bayes classifier performs better compared to other models assuming independence
    • Performs well in the case of categorical input variables compared to numerical variables
  • Cons
    • zero frequency (solved by smoothing techniques like laplace estimation, or adding 1 to avoid dividing by zero)
    • Bad estimator - probability estimates are understood to not be taken too seriously
    • Assumption of independent predictors, which is almost never the case.
  • Applications
    • Credit scoring
    • Medical
    • Real time prediction
    • Multi-class predictions
    • Text classification, spam filtering, sentiment analysis
    • recommendation filtering
  • Gaussian naive bayes: assume continuous data has Gaussian distribution
  • The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space

Factor Analysis

  • number of variables too large
  • deviations or variation that is of most interest
  • reduce number of variables
  • consider linear combinations of the variables
  • keep the combos with large variance
  • discard the ones with small variance
  • latent variables explain the correlation between outcome variables
  • interpretability of factors is sometimes suspect
  • Used for exploratory data analysis
  • >10 obs per variable
  • Group variables into factors such that the variables are highly correlated
  • Use PCA to examine latent common factors (1st method)

Principle Component Analysis

  • Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
  • factor loadings which represent