# Statistics

## General

• Test MathJax on this page: $y=\sum_{i=1}^n x \delta \bar{Xfc}$
• Exploratory Data Analysis - branch of statistics emphasizing visuals, developed by John Tukey
• Given eqn y = mx + b, dependent variable is , and independent variable is x.
• Statistical unit = one member of entities being studied. One person in population study, one image in classification problem.
• Conditional probability - the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the (conditional) probability of A, given B"
• Joint probability is the probability of both events happening together. The joint probability of A and B is written P(A upsidedown U B), P(AB) or P(A, B)
• Marginal probability is essentially the opposite of conditional probability. For example, if there are two possible outcomes for X with corresponding events B and B', this means that \scriptstyle P(A) = P(A \cap B) + P(A \cap B^').
• column rank and row rank
• degrees of freedom = the number of values in the final calculation that are free to vary.
• residuals = for each observation residual is the difference between that observation and the average of all the observations.
• the sum of the residuals is necessarily 0.
• probabilty mass function = pmf is for for DISCRETE random variables
• principle of indifference, which assigns equal probabilities to all possibilities.

## null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

## binomial probability

### binomial distribution

• p = probability that event will occur
• q = probability that event won't occur
• p and q are complementary = p + q = 1
• n = number of trials
• k = number of successes

### Binomial approximation

• standard score = how many standard deviations an observation is above or below the mean.

## Tests for Categorical Data

• Goodness of fit for single categorical variable
• compare observed counts to the expected counts "contribution terms" for
• Get relative distance the observed are from the expected
• Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
• If null hypothesis is true, observed is close to expected
• Relative distance the observed are to the expected
• Test statistic has chi-squared distribution.
 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);
testf=(values)

### Tests for two-way variables

• test for homogeneity - distribution of proportions are the same across the populations
• test of independence -
 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
testf=(values)
• Use fisher's exact test if sample num is small.
• R: fisher.test(table)
• cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
• weight statement indicates the variable in the table

## T-Test

• "Student's t-distribution"
• When data are normally distributed
• Can test hypotheses about the mean/center of the distribution

### One-sample t-test

• Test is mean greater than/less than/equal to some value
• SAS proc means

### Two-Sample t-test

• Test whether two population means are equal.
• Unpaired or independent samples t-test: Are the variances the same?
• If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
• If yes it's called a "pooled t-test" or "Student's t-test"
• F-statistic tests whether the variances are equal
• Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

## Nonparametric tests

• Hypothesis testing when you can't assume data comes from normal distribution
• a lot of non-parametric approaches are based on ranks
• do not depend on normality
• Where as the other test are really tests for means, npar tests are actually for medians

### One-sample tests

• SAS proc univariate for these
• Sign test
• Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
• PAIRED observations with test x > y, x = y, or x < y.
• Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
• Does one member of the pair tend to be greater than the other?
• Does NOT assume symmetric distribution of the differences around the median
• Does NOT use the magnitude of the difference
• Wilcoxon Signed Ranks Test
• A quantitative Sign Test
• DOES use magnitude of difference of paired observations
• Confidence interval based on signed rank test
• what are the set of values for which you wouldn't have rejected the null hypothesis

### Two or more sample nonparametric tests

• Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
• use deviations for the median and use the signed ranks
• SAS: proc npar1way wilcoxon
• Class variable is used for the two or more groups
• Otherwise use proc npar1way anova
• Wilcoxon Rank Sum Test/Mann-Whitney U statistic
• Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
• Equivalent of unequal variances t-test
• R: wilcox.test
• Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
• Can also do confidence interval
• Kruskal-Wallis
• Non-parametric method for testing whether samples originate from the same distribution
• equivalent to One-way ANOVA

## Goodness of fit for continuous distributions

### one sample

• empirical cumulative distribution function, compare to theoretical
• R: ecdf(data)
• Kolmogorov-Smirnov
• Not quite as good, because this just gives a max of the W statistic
• Do not estimate parameters from the data
• R: ks.test(x, y="name")

### Two-Sample

• Could have two distributions with the same mean but different shapes.
• R: ks.test(X, Y)

## Estimating Parameter Values

• R: MASS package, fitdistr(data, densfun="exponential")
• obtain maximum likelihood estimate

## Kernel Smoothing Density Function

• Matlab function
• [f,xi,u] = ksdensity(x)
• Computes a probability density estimate of the sample in the vector x
• f is the vector of density values evaluated at the points xi.
• u is the width of the kernal -smoothing window, which is calculated.

## Linear Regression

• Linear Regression - Wikipedia article
• y = X beta + epsilon
• y = the regressand, dependent variable.
• X = the design matrix. x sub i are regressors
• each x sub i has a corresponding beta sub i called the intercept
• beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
• DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

### Regression Diagnostics

• multicollinearity -> VIF
• heteroscedasticity -> Scale-Location or Residual vs fitted
• Outliers -> Residuals vs Leverage or Leverage vs Cook's D
• Non-linearity -> Residual vs fitted
• Residual distribution -> Q-Q Plot
• Understanding Regression Diagnostic Plots
• R: Use ggfortify ::autoplot

## Set theory symbols

• Set theory symbols
• $\varnothing$: \varnothing, empty set
• $\mid$: \mid, satisfies the condition
• $\cup$: \cup, union
• $\cap$: \cap, intersection
• $\setminus$: \setminus
• $\triangle$: \triangle, symmetric difference
• $\in$: \in - left side element is in right side set
• $\cdot$: \cdot, dot product, vector and matrix multiplication, scalar result
• $\times$: \times, cross product of vectors
• $\otimes$: \otimes, kronecker (outer) product of tensor (matrix)

## Bayes

• $p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} = \text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}}$
• In practice there's is only interest in the numerator of that fraction, because the denominator does not depend on C, and the values on feature $x_i$ are given, so the denominator is effectively constant.
• The numerator is equivalent to the joint probability model
• If we assume each feature is conditionally independent of every other, then the joint model can be expressed as

\begin{align} p(C_k \mid x_1, \dots, x_n) & \varpropto p(C_k, x_1, \dots, x_n) \\ & = p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\ & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,, \end{align}

• Classifier combines probability model with a decision rule, i.e. maximum a posteriori

### Conditional probability

• What is the probability that a given observation D belongs to a given class C, $p(C \mid D)$
• "The probability of A under the condition B" $p(A \mid B)$
• There need not be a causal relationship
• Compare with UNconditional probability $p(A)$
• If $p(A \mid B) = p( A )$, then events are independent, knowledge about either event does not give information on the other. Otherwise, $P(A \cap B) = P(A) P(B).$
• Don't falsely equate $p(A \mid B)$ and $p(B \mid A)$
• Defined as the quotient of the joint of events A and B and the probability of B: $P(A \mid B) = \frac{P(A \cap B)}{P(B)},$, where numerator is the probability that both events A and B occur.
• Joint probability $P(A \cap B) = P(A \mid B)P(B)$

### General

• Compare vs. Frequentist
• Pros:
• Easy and fast to predict a class of test dataset
• Naive Bayes classifier performs better compared to other models assuming independence
• Performs well in the case of categorical input variables compared to numerical variables
• Cons
• zero frequency (solved by smoothing techniques like laplace estimation, or adding 1 to avoid dividing by zero)
• Bad estimator - probability estimates are understood to not be taken too seriously
• Assumption of independent predictors, which is almost never the case.
• Applications
• Credit scoring
• Medical
• Real time prediction
• Multi-class predictions
• Text classification, spam filtering, sentiment analysis
• recommendation filtering
• Gaussian naive bayes: assume continuous data has Gaussian distribution
• The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space

## Factor Analysis

• number of variables too large
• deviations or variation that is of most interest
• reduce number of variables
• consider linear combinations of the variables
• keep the combos with large variance
• discard the ones with small variance
• latent variables explain the correlation between outcome variables
• interpretability of factors is sometimes suspect
• Used for exploratory data analysis
• >10 obs per variable
• Group variables into factors such that the variables are highly correlated
• Use PCA to examine latent common factors (1st method)

## Principle Component Analysis

• Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.