Difference between revisions of "Statistics"

From Colettapedia
Jump to navigation Jump to search
Line 177: Line 177:
  
 
==Bayes==
 
==Bayes==
* <math>p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} = \text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}}</math>
+
* [[Bayesian Data Analysis]]
* In practice there's is only interest in the numerator of that fraction, because the denominator does not depend on C, and the values on feature <math>x_i</math> are given, so the denominator is effectively constant.
 
* The numerator is equivalent to the joint probability model
 
* If we assume each feature is conditionally independent of every other, then the joint model can be expressed as
 
 
 
<math>
 
\begin{align}
 
p(C_k \mid x_1, \dots, x_n) & \varpropto p(C_k, x_1, \dots, x_n) \\
 
                            & = p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\
 
                            & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,,
 
\end{align}
 
</math>
 
* Classifier combines probability model with a decision rule, i.e. maximum a posteriori
 
===Conditional probability===
 
* What is the probability that a given observation D belongs to a given class C, <math>p(C \mid D)</math>
 
* "The probability of A under the condition B" <math>p(A \mid B)</math>
 
* There need not be a causal relationship
 
* Compare with UNconditional probability <math>p(A)</math>
 
* If <math>p(A \mid B) = p( A )</math>, then events are independent, knowledge about either event does not give information on the other. Otherwise, <math>P(A \cap B) = P(A) P(B).</math>
 
* Don't falsely equate <math>p(A \mid B)</math> and <math>p(B \mid A)</math>
 
* Defined as the quotient of the joint of events A and B and the probability of B: <math>P(A \mid B) = \frac{P(A \cap B)}{P(B)},</math>, where numerator is the probability that both events A and B occur.
 
* Joint probability <math>P(A \cap B) = P(A \mid B)P(B)</math>
 
===General===
 
* Compare vs. Frequentist
 
* [https://www.youtube.com/watch?v=CPqOCI0ahss Naive Bayes youtube vid]
 
* Pros:
 
** Easy and fast to predict a class of test dataset
 
** Naive Bayes classifier performs better compared to other models assuming independence
 
** Performs well in the case of categorical input variables compared to numerical variables
 
* Cons
 
** zero frequency (solved by smoothing techniques like laplace estimation, or adding 1 to avoid dividing by zero)
 
** Bad estimator - probability estimates are understood to not be taken too seriously
 
** Assumption of independent predictors, which is almost never the case.
 
* Applications
 
** Credit scoring
 
** Medical
 
** Real time prediction
 
** Multi-class predictions
 
** Text classification, spam filtering, sentiment analysis
 
** recommendation filtering
 
* Gaussian naive bayes: assume continuous data has Gaussian distribution
 
* The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space
 
  
 
==Factor Analysis==
 
==Factor Analysis==

Revision as of 16:25, 25 September 2019

General

  • Test MathJax on this page:
  • Exploratory Data Analysis - branch of statistics emphasizing visuals, developed by John Tukey
  • Given eqn y = mx + b, dependent variable is , and independent variable is x.
  • Statistical unit = one member of entities being studied. One person in population study, one image in classification problem.
  • Conditional probability - the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the (conditional) probability of A, given B"
  • Joint probability is the probability of both events happening together. The joint probability of A and B is written P(A upsidedown U B), P(AB) or P(A, B)
  • Marginal probability is essentially the opposite of conditional probability. For example, if there are two possible outcomes for X with corresponding events B and B', this means that \scriptstyle P(A) = P(A \cap B) + P(A \cap B^').
  • column rank and row rank
  • degrees of freedom = the number of values in the final calculation that are free to vary.
  • residuals = for each observation residual is the difference between that observation and the average of all the observations.
    • the sum of the residuals is necessarily 0.
  • probabilty mass function = pmf is for for DISCRETE random variables
  • principle of indifference, which assigns equal probabilities to all possibilities.

Error bars

null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

binomial probability

binomial distribution

  • p = probability that event will occur
  • q = probability that event won't occur
  • p and q are complementary = p + q = 1
  • n = number of trials
  • k = number of successes

Binomial approximation

  • standard score = how many standard deviations an observation is above or below the mean.

Tests for Categorical Data

  • Goodness of fit for single categorical variable
    • compare observed counts to the expected counts "contribution terms" for
    • Get relative distance the observed are from the expected
    • Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
    • If null hypothesis is true, observed is close to expected
    • Relative distance the observed are to the expected
    • Test statistic has chi-squared distribution.
 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);
testf=(values)

Tests for two-way variables

  • test for homogeneity - distribution of proportions are the same across the populations
  • test of independence -
 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
testf=(values)
  • Use fisher's exact test if sample num is small.
    • R: fisher.test(table)
  • cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
  • weight statement indicates the variable in the table

T-Test

  • "Student's t-distribution"
  • When data are normally distributed
  • Can test hypotheses about the mean/center of the distribution

One-sample t-test

  • Test is mean greater than/less than/equal to some value
  • SAS proc means

Two-Sample t-test

  • Test whether two population means are equal.
  • Unpaired or independent samples t-test: Are the variances the same?
    • If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
    • If yes it's called a "pooled t-test" or "Student's t-test"
    • F-statistic tests whether the variances are equal
  • Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

Nonparametric tests

  • Hypothesis testing when you can't assume data comes from normal distribution
  • a lot of non-parametric approaches are based on ranks
  • do not depend on normality
  • Where as the other test are really tests for means, npar tests are actually for medians

One-sample tests

  • SAS proc univariate for these
  • Sign test
    • Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
    • PAIRED observations with test x > y, x = y, or x < y.
    • Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
    • Does one member of the pair tend to be greater than the other?
    • Does NOT assume symmetric distribution of the differences around the median
    • Does NOT use the magnitude of the difference
  • Wilcoxon Signed Ranks Test
    • A quantitative Sign Test
    • DOES use magnitude of difference of paired observations
  • Confidence interval based on signed rank test
    • what are the set of values for which you wouldn't have rejected the null hypothesis

Two or more sample nonparametric tests

  • Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
  • use deviations for the median and use the signed ranks
  • SAS: proc npar1way wilcoxon
    • Class variable is used for the two or more groups
    • Otherwise use proc npar1way anova
  • Wilcoxon Rank Sum Test/Mann-Whitney U statistic
    • Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
    • Equivalent of unequal variances t-test
    • R: wilcox.test
    • Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
    • Can also do confidence interval
  • Kruskal-Wallis
    • Non-parametric method for testing whether samples originate from the same distribution
    • equivalent to One-way ANOVA

Goodness of fit for continuous distributions

one sample

  • empirical cumulative distribution function, compare to theoretical
    • R: ecdf(data)
  • Kolmogorov-Smirnov
    • Not quite as good, because this just gives a max of the W statistic
  • Do not estimate parameters from the data
  • R: ks.test(x, y="name")

Two-Sample

  • Could have two distributions with the same mean but different shapes.
  • R: ks.test(X, Y)

Estimating Parameter Values

  • R: MASS package, fitdistr(data, densfun="exponential")
    • obtain maximum likelihood estimate

Kernel Smoothing Density Function

  • Matlab function
  • [f,xi,u] = ksdensity(x)
  • Computes a probability density estimate of the sample in the vector x
  • f is the vector of density values evaluated at the points xi.
  • u is the width of the kernal -smoothing window, which is calculated.

Linear Discriminant Analysis

Receiver operating characteristic curve

Linear Regression

  • Linear Regression - Wikipedia article
  • y = X beta + epsilon
  • y = the regressand, dependent variable.
  • X = the design matrix. x sub i are regressors
  • each x sub i has a corresponding beta sub i called the intercept
  • beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
  • DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

Regression Diagnostics

  • multicollinearity -> VIF
  • heteroscedasticity -> Scale-Location or Residual vs fitted
  • Outliers -> Residuals vs Leverage or Leverage vs Cook's D
  • Non-linearity -> Residual vs fitted
  • Residual distribution -> Q-Q Plot
  • Understanding Regression Diagnostic Plots
  • R: Use ggfortify ::autoplot

Eigen vector & Eigen Value

Eigen values and eigenvectors


Maximum Likelihood Estimate

Mixed and Multilevel Models

Set theory symbols

  • Set theory symbols
  • : \varnothing, empty set
  • : \mid, satisfies the condition
  • : \cup, union
  • : \cap, intersection
  • : \setminus
  • : \triangle, symmetric difference
  • : \in - left side element is in right side set
  • : \cdot, dot product, vector and matrix multiplication, scalar result
  • : \times, cross product of vectors
  • : \otimes, kronecker (outer) product of tensor (matrix)

Bayes

Factor Analysis

  • number of variables too large
  • deviations or variation that is of most interest
  • reduce number of variables
  • consider linear combinations of the variables
  • keep the combos with large variance
  • discard the ones with small variance
  • latent variables explain the correlation between outcome variables
  • interpretability of factors is sometimes suspect
  • Used for exploratory data analysis
  • >10 obs per variable
  • Group variables into factors such that the variables are highly correlated
  • Use PCA to examine latent common factors (1st method)

Principle Component Analysis

  • Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
  • factor loadings which represent