Difference between revisions of "Statistics"

Revision as of 16:25, 25 September 2019

General

Test MathJax on this page: $y=\sum _{i=1}^{n}x\delta {\bar {Xfc}}$
Exploratory Data Analysis - branch of statistics emphasizing visuals, developed by John Tukey
Given eqn y = mx + b, dependent variable is , and independent variable is x.
Statistical unit = one member of entities being studied. One person in population study, one image in classification problem.
Conditional probability - the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the (conditional) probability of A, given B"
Joint probability is the probability of both events happening together. The joint probability of A and B is written P(A upsidedown U B), P(AB) or P(A, B)
Marginal probability is essentially the opposite of conditional probability. For example, if there are two possible outcomes for X with corresponding events B and B', this means that \scriptstyle P(A) = P(A \cap B) + P(A \cap B^').
column rank and row rank
degrees of freedom = the number of values in the final calculation that are free to vary.
residuals = for each observation residual is the difference between that observation and the average of all the observations.
- the sum of the residuals is necessarily 0.
probabilty mass function = pmf is for for DISCRETE random variables
principle of indifference, which assigns equal probabilities to all possibilities.

Error bars

Standard error

null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

binomial probability

binomial distribution

p = probability that event will occur
q = probability that event won't occur
p and q are complementary = p + q = 1
n = number of trials
k = number of successes

Binomial approximation

standard score = how many standard deviations an observation is above or below the mean.

Tests for Categorical Data

Goodness of fit for single categorical variable
- compare observed counts to the expected counts "contribution terms" for
- Get relative distance the observed are from the expected
- Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
- If null hypothesis is true, observed is close to expected
- Relative distance the observed are to the expected
- Test statistic has chi-squared distribution.

 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);
testf=(values)

Tests for two-way variables

test for homogeneity - distribution of proportions are the same across the populations
test of independence -

 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
testf=(values)

Use fisher's exact test if sample num is small.
- R: fisher.test(table)
cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
weight statement indicates the variable in the table

T-Test

"Student's t-distribution"
When data are normally distributed
Can test hypotheses about the mean/center of the distribution

One-sample t-test

Test is mean greater than/less than/equal to some value
SAS proc means

Two-Sample t-test

Test whether two population means are equal.
Unpaired or independent samples t-test: Are the variances the same?
- If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
- If yes it's called a "pooled t-test" or "Student's t-test"
- F-statistic tests whether the variances are equal
Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

Nonparametric tests

Hypothesis testing when you can't assume data comes from normal distribution
a lot of non-parametric approaches are based on ranks
do not depend on normality
Where as the other test are really tests for means, npar tests are actually for medians

One-sample tests

SAS proc univariate for these
Sign test
- Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
- PAIRED observations with test x > y, x = y, or x < y.
- Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
- Does one member of the pair tend to be greater than the other?
- Does NOT assume symmetric distribution of the differences around the median
- Does NOT use the magnitude of the difference
Wilcoxon Signed Ranks Test
- A quantitative Sign Test
- DOES use magnitude of difference of paired observations
Confidence interval based on signed rank test
- what are the set of values for which you wouldn't have rejected the null hypothesis

Two or more sample nonparametric tests

Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
use deviations for the median and use the signed ranks

SAS: proc npar1way wilcoxon
- Class variable is used for the two or more groups
- Otherwise use proc npar1way anova
Wilcoxon Rank Sum Test/Mann-Whitney U statistic
- Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
- Equivalent of unequal variances t-test
- R: wilcox.test
- Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
- Can also do confidence interval
Kruskal-Wallis
- Non-parametric method for testing whether samples originate from the same distribution
- equivalent to One-way ANOVA

Goodness of fit for continuous distributions

one sample

empirical cumulative distribution function, compare to theoretical
- R: ecdf(data)
Kolmogorov-Smirnov
- Not quite as good, because this just gives a max of the W statistic
Do not estimate parameters from the data
R: ks.test(x, y="name")

Two-Sample

Could have two distributions with the same mean but different shapes.
R: ks.test(X, Y)

Estimating Parameter Values

R: MASS package, fitdistr(data, densfun="exponential")
- obtain maximum likelihood estimate

Kernel Smoothing Density Function

Matlab function
[f,xi,u] = ksdensity(x)
Computes a probability density estimate of the sample in the vector x
f is the vector of density values evaluated at the points xi.
u is the width of the kernal -smoothing window, which is calculated.

Linear Discriminant Analysis

Linear Discriminant Analysis

Receiver operating characteristic curve

wiki article

Linear Regression

Linear Regression - Wikipedia article
y = X beta + epsilon
y = the regressand, dependent variable.
X = the design matrix. x sub i are regressors
each x sub i has a corresponding beta sub i called the intercept
beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept

DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

Regression Diagnostics

multicollinearity -> VIF
heteroscedasticity -> Scale-Location or Residual vs fitted
Outliers -> Residuals vs Leverage or Leverage vs Cook's D
Non-linearity -> Residual vs fitted
Residual distribution -> Q-Q Plot
Understanding Regression Diagnostic Plots
R: Use ggfortify ::autoplot

Eigen vector & Eigen Value

Eigen values and eigenvectors

Maximum Likelihood Estimate

Mixed and Multilevel Models

Set theory symbols

Set theory symbols
$\varnothing$ : \varnothing, empty set
$\mid$ : \mid, satisfies the condition
$\cup$ : \cup, union
$\cap$ : \cap, intersection
$\setminus$ : \setminus
$\triangle$ : \triangle, symmetric difference
$\in$ : \in - left side element is in right side set
$\cdot$ : \cdot, dot product, vector and matrix multiplication, scalar result
$\times$ : \times, cross product of vectors
$\otimes$ : \otimes, kronecker (outer) product of tensor (matrix)

Bayes

Bayesian Data Analysis

Factor Analysis

number of variables too large
deviations or variation that is of most interest
reduce number of variables
consider linear combinations of the variables
keep the combos with large variance
discard the ones with small variance
latent variables explain the correlation between outcome variables
interpretability of factors is sometimes suspect
Used for exploratory data analysis
>10 obs per variable
Group variables into factors such that the variables are highly correlated
Use PCA to examine latent common factors (1st method)

Principle Component Analysis

Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
factor loadings which represent

@@ Line 177: / Line 177: @@
 ==Bayes==
-* <math>p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} = \text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}}</math>
+* [[Bayesian Data Analysis]]
-* In practice there's is only interest in the numerator of that fraction, because the denominator does not depend on C, and the values on feature <math>x_i</math> are given, so the denominator is effectively constant.
-* The numerator is equivalent to the joint probability model
-* If we assume each feature is conditionally independent of every other, then the joint model can be expressed as
-<math>
-\begin{align}
-p(C_k \mid x_1, \dots, x_n) & \varpropto p(C_k, x_1, \dots, x_n) \\
-                            & = p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\
-                            & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,,
-\end{align}
-</math>
-* Classifier combines probability model with a decision rule, i.e. maximum a posteriori
-===Conditional probability===
-* What is the probability that a given observation D belongs to a given class C, <math>p(C \mid D)</math>
-* "The probability of A under the condition B" <math>p(A \mid B)</math>
-* There need not be a causal relationship
-* Compare with UNconditional probability <math>p(A)</math>
-* If <math>p(A \mid B) = p( A )</math>, then events are independent, knowledge about either event does not give information on the other. Otherwise, <math>P(A \cap B) = P(A) P(B).</math>
-* Don't falsely equate <math>p(A \mid B)</math> and <math>p(B \mid A)</math>
-* Defined as the quotient of the joint of events A and B and the probability of B: <math>P(A \mid B) = \frac{P(A \cap B)}{P(B)},</math>, where numerator is the probability that both events A and B occur.
-* Joint probability <math>P(A \cap B) = P(A \mid B)P(B)</math>
-===General===
-* Compare vs. Frequentist
-* [https://www.youtube.com/watch?v=CPqOCI0ahss Naive Bayes youtube vid]
-* Pros:
-** Easy and fast to predict a class of test dataset
-** Naive Bayes classifier performs better compared to other models assuming independence
-** Performs well in the case of categorical input variables compared to numerical variables
-* Cons
-** zero frequency (solved by smoothing techniques like laplace estimation, or adding 1 to avoid dividing by zero)
-** Bad estimator - probability estimates are understood to not be taken too seriously
-** Assumption of independent predictors, which is almost never the case.
-* Applications
-** Credit scoring
-** Medical
-** Real time prediction
-** Multi-class predictions
-** Text classification, spam filtering, sentiment analysis
-** recommendation filtering
-* Gaussian naive bayes: assume continuous data has Gaussian distribution
-* The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space
 ==Factor Analysis==