Statistics

From Colettapedia
Jump to navigation Jump to search

General

  • Test MathJax on this page:
  • Exploratory Data Analysis - branch of statistics emphasizing visuals, developed by John Tukey
  • Given eqn y = mx + b, dependent variable is , and independent variable is x.
  • Statistical unit = one member of entities being studied. One person in population study, one image in classification problem.
  • Conditional probability - the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the (conditional) probability of A, given B"
  • Joint probability is the probability of both events happening together. The joint probability of A and B is written P(A upsidedown U B), P(AB) or P(A, B)
  • Marginal probability is essentially the opposite of conditional probability. For example, if there are two possible outcomes for X with corresponding events B and B', this means that \scriptstyle P(A) = P(A \cap B) + P(A \cap B^').
  • column rank and row rank
  • degrees of freedom = the number of values in the final calculation that are free to vary.
  • residuals = for each observation residual is the difference between that observation and the average of all the observations.
    • the sum of the residuals is necessarily 0.
  • probabilty mass function = pmf is for for DISCRETE random variables
  • principle of indifference, which assigns equal probabilities to all possibilities.

Error bars

null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

binomial probability

binomial distribution

  • p = probability that event will occur
  • q = probability that event won't occur
  • p and q are complementary = p + q = 1
  • n = number of trials
  • k = number of successes

Binomial approximation

  • standard score = how many standard deviations an observation is above or below the mean.

Tests for Categorical Data

  • Goodness of fit for single categorical variable
    • compare observed counts to the expected counts "contribution terms" for
    • Get relative distance the observed are from the expected
    • Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
    • If null hypothesis is true, observed is close to expected
    • Relative distance the observed are to the expected
    • Test statistic has chi-squared distribution.
 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);
testf=(values)

Tests for two-way variables

  • test for homogeneity - distribution of proportions are the same across the populations
  • test of independence -
 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
testf=(values)
  • Use fisher's exact test if sample num is small.
    • R: fisher.test(table)
  • cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
  • weight statement indicates the variable in the table

T-Test

  • "Student's t-distribution"
  • When data are normally distributed
  • Can test hypotheses about the mean/center of the distribution

One-sample t-test

  • Test is mean greater than/less than/equal to some value
  • SAS proc means

Two-Sample t-test

  • Test whether two population means are equal.
  • Unpaired or independent samples t-test: Are the variances the same?
    • If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
    • If yes it's called a "pooled t-test" or "Student's t-test"
    • F-statistic tests whether the variances are equal
  • Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

Nonparametric tests

  • Hypothesis testing when you can't assume data comes from normal distribution
  • a lot of non-parametric approaches are based on ranks
  • do not depend on normality
  • Where as the other test are really tests for means, npar tests are actually for medians

One-sample tests

  • SAS proc univariate for these
  • Sign test
    • Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
    • PAIRED observations with test x > y, x = y, or x < y.
    • Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
    • Does one member of the pair tend to be greater than the other?
    • Does NOT assume symmetric distribution of the differences around the median
    • Does NOT use the magnitude of the difference
  • Wilcoxon Signed Ranks Test
    • A quantitative Sign Test
    • DOES use magnitude of difference of paired observations
  • Confidence interval based on signed rank test
    • what are the set of values for which you wouldn't have rejected the null hypothesis

Two or more sample nonparametric tests

  • Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
  • use deviations for the median and use the signed ranks
  • SAS: proc npar1way wilcoxon
    • Class variable is used for the two or more groups
    • Otherwise use proc npar1way anova
  • Wilcoxon Rank Sum Test/Mann-Whitney U statistic
    • Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
    • Equivalent of unequal variances t-test
    • R: wilcox.test
    • Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
    • Can also do confidence interval
  • Kruskal-Wallis
    • Non-parametric method for testing whether samples originate from the same distribution
    • equivalent to One-way ANOVA

Goodness of fit for continuous distributions

one sample

  • empirical cumulative distribution function, compare to theoretical
    • R: ecdf(data)
  • Kolmogorov-Smirnov
    • Not quite as good, because this just gives a max of the W statistic
  • Do not estimate parameters from the data
  • R: ks.test(x, y="name")

Two-Sample

  • Could have two distributions with the same mean but different shapes.
  • R: ks.test(X, Y)

Estimating Parameter Values

  • R: MASS package, fitdistr(data, densfun="exponential")
    • obtain maximum likelihood estimate

Kernel Smoothing Density Function

  • Matlab function
  • [f,xi,u] = ksdensity(x)
  • Computes a probability density estimate of the sample in the vector x
  • f is the vector of density values evaluated at the points xi.
  • u is the width of the kernal -smoothing window, which is calculated.

Linear Discriminant Analysis

Receiver operating characteristic curve

Linear Regression

  • Linear Regression - Wikipedia article
  • y = X beta + epsilon
  • y = the regressand, dependent variable.
  • X = the design matrix. x sub i are regressors
  • each x sub i has a corresponding beta sub i called the intercept
  • beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
  • DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

Regression Diagnostics

  • multicollinearity -> VIF
  • heteroscedasticity -> Scale-Location or Residual vs fitted
  • Outliers -> Residuals vs Leverage or Leverage vs Cook's D
  • Non-linearity -> Residual vs fitted
  • Residual distribution -> Q-Q Plot
  • Understanding Regression Diagnostic Plots
  • R: Use ggfortify ::autoplot

Eigen vector & Eigen Value

Eigen values and eigenvectors


Maximum Likelihood Estimate

Mixed and Multilevel Models

Set theory symbols

  • Set theory symbols
  • : \varnothing, empty set
  • : \mid, satisfies the condition
  • : \cup, union
  • : \cap, intersection
  • : \setminus
  • : \triangle, symmetric difference
  • : \in - left side element is in right side set
  • : \cdot, dot product, vector and matrix multiplication, scalar result
  • : \times, cross product of vectors
  • : \otimes, kronecker (outer) product of tensor (matrix)

Bayes

  • In practice there's is only interest in the numerator of that fraction, because the denominator does not depend on C, and the values on feature are given, so the denominator is effectively constant.
  • The numerator is equivalent to the joint probability model
  • If we assume each feature is conditionally independent of every other, then the joint model can be expressed as

  • Classifier combines probability model with a decision rule, i.e. maximum a posteriori

Conditional probability

  • What is the probability that a given observation D belongs to a given class C,
  • "The probability of A under the condition B"
  • There need not be a causal relationship
  • Compare with UNconditional probability
  • If , then events are independent, knowledge about either event does not give information on the other. Otherwise,
  • Don't falsely equate and
  • Defined as the quotient of the joint of events A and B and the probability of B: , where numerator is the probability that both events A and B occur.
  • Joint probability

General

  • Compare vs. Frequentist
  • Naive Bayes youtube vid
  • Pros:
    • Easy and fast to predict a class of test dataset
    • Naive Bayes classifier performs better compared to other models assuming independence
    • Performs well in the case of categorical input variables compared to numerical variables
  • Cons
    • zero frequency (solved by smoothing techniques like laplace estimation, or adding 1 to avoid dividing by zero)
    • Bad estimator - probability estimates are understood to not be taken too seriously
    • Assumption of independent predictors, which is almost never the case.
  • Applications
    • Credit scoring
    • Medical
    • Real time prediction
    • Multi-class predictions
    • Text classification, spam filtering, sentiment analysis
    • recommendation filtering
  • Gaussian naive bayes: assume continuous data has Gaussian distribution
  • The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space

Factor Analysis

  • number of variables too large
  • deviations or variation that is of most interest
  • reduce number of variables
  • consider linear combinations of the variables
  • keep the combos with large variance
  • discard the ones with small variance
  • latent variables explain the correlation between outcome variables
  • interpretability of factors is sometimes suspect
  • Used for exploratory data analysis
  • >10 obs per variable
  • Group variables into factors such that the variables are highly correlated
  • Use PCA to examine latent common factors (1st method)

Principle Component Analysis

  • Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
  • factor loadings which represent