Statistics

1 General
2 Basic Probability
- 2.1 Joint probability (intersection)
- 2.2 Marginal probability
- 2.3 Union of two events
- 2.4 Conditional probability
- 2.5 Variable Independence
3 Error bars
4 null hypothesis
5 Probability Distributions
- 5.1 Discrete Distributions
- 5.2 Continuous Distributions
- 5.3 Cumulative Distribution Function
- 5.4 R functions
  - 5.4.1 Normal distribution
  - 5.4.2 Various distributions
6 Frequntist approach to SI
- 6.1 Confidence Intervals
- 6.2 Bernoulli Example
7 Bayesian Inference
- 7.1 Bernoulli Example
8 Tests for Categorical Data
- 8.1 Tests for two-way variables
9 T-Test
- 9.1 One-sample t-test
- 9.2 Two-Sample t-test
10 Nonparametric tests
- 10.1 One-sample tests
- 10.2 Two or more sample nonparametric tests
11 Goodness of fit for continuous distributions
- 11.1 one sample
- 11.2 Two-Sample
12 Estimating Parameter Values
13 Kernel Smoothing Density Function
14 Linear Discriminant Analysis
15 Receiver operating characteristic curve
16 Linear Regression
- 16.1 Regression Diagnostics
17 Eigen vector & Eigen Value
18 Mixed and Multilevel Models
19 Set theory symbols
20 Bayesian Statistics
21 Maximum Likelihood Estimation
22 Factor Analysis
23 Principle Component Analysis

General

degrees of freedom = the number of values in the final calculation that are free to vary.
residuals = for each observation residual is the difference between that observation and the average of all the observations.
- the sum of the residuals is necessarily 0.
probabilty mass function = pmf is for for DISCRETE random variables

Basic Probability

Joint probability (intersection)

The probability of both events happening together.
The joint probability of A and B is written $P(A\cap B)$ , P(AB) or P(A, B)
latex set intersection sign \cap - Python operator & ampersand
Joint event = depends on classes from two different variables
Joint probability distribution for categorical variables - list out in a table, all numbers sum to 1. Marginal tallies is sum of joint probs, ignores one of the variables.

Marginal probability

Essentially the opposite of conditional probability.
If there are two possible outcomes for X with corresponding events B and B', this means that $P(A)=P(A\cap B)+P(A\cap B^{'})$ .
column rank and row rank

Union of two events

$P(A\cup B)=P(A)+P(B)-P(A\cap B)$

Conditional probability

The probability of some event A, given the occurrence of some other event B.
Conditional probability is written $P(A|B)={\dfrac {P(A\cap B}{P(B)}}$
Joint probability divided by the marginal probability
Conditional probability distribution - adds up to 1
- Compare with marginal probability distribution

Variable Independence

If independent then $P(A|B)=P(A)$
Imposing the condition B doesn't affect the probability of A at all.
In a sample, we would expect the two probabilities to not match up slightly anyway
If independent then the joint probability/intersection/upside-down U $P(A\cap B)=P(A)\times P(B)$

Error bars

Standard error

null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

Probability Distributions

Notation convention
- Capital roman letter - a random variable
- Lowercase roman letter - a possible value it might take
- Unknown values represented with greek letters rather than roman
Probability mass function - gives the probability of different outcomes of the random variable
- PMF for discrete and probability density function (PDF) if continuous. Can view everything as a density, though.

Discrete Distributions

Bernoulli

Only 2 possible outcomes, success or failure
$X\sim B(p)$ , $P(X=1)=p$ , $P(X=0)=1-p$
$f(X=x|p)=f(x|p)=p^{x}(1-p)^{1-x}$
$E[X]=\sum _{x}xP(X=x)=(1)p+(0)(1-p)=p$ $E[X]=\sum _{x}xP(X=x)=(1)p+(0)(1-p)=p$
- Expected value
${\textrm {var}}(X)=p(1-p)$

Binomial

Binomial = Generalization of Bernoulli where you have N repeated trials. The sum of N independent Bernoullis
$X\sim Bin(n,p)$ $X\sim Bin(n,p)$
- n = number of trials
$P(X=x|p)=f(x|p)={n \choose x}p^{x}(1-p)^{n-x}$ $P(X=x|p)=f(x|p)={n \choose x}p^{x}(1-p)^{n-x}$
- ${n \choose x}={\frac {n!}{x!(n-x)!}}$ ${n \choose x}={\frac {n!}{x!(n-x)!}}$
  - "n choose x" is combinatorial
$E[X]=np$ $E[X]=np$
- Expected value
$Var(X)=np(1-p)$
Binomial approximation standard score - how many standard deviations an observation is above or below the mean.

Geometric

The number of trials to observe a success

Multinomial

Generalize bernoulli and binomial to more than one possible outcome

Poisson

Used for counts
parameter $\lambda >0$ is the rate at which we expect to observe the thing we are counting

Continuous Distributions

$\int _{-\infty }^{\infty }f(x)dx=1$ : "The probability that something happens = 1"
f(x) >= 0: Densities are non-negative for all possible values of x
E[X] = Integral from -inf to inf x f(x) dx
- Analogous to the sum we have in discrete variables

Uniform

$X\sim U[0,1]$ $X\sim U[0,1]$
- f(x) = Indicator function where condition is x is on interval
$X\sim U[\theta _{1},\theta _{2}]$ $X\sim U[\theta _{1},\theta _{2}]$
- $f(x|\theta _{1},\theta _{2})={\frac {1}{\theta _{2}-\theta _{1}}}I_{\{\theta _{1}\leq x\leq \theta _{2}\}}$
Integrate to ascertain probability between two given values
probability that x= some given value is vanishingly small, integrate from x to x = 0

Exponential

E.g, a bus that comes every 10 minutes, the exponential is your waiting time
Rate parameter $\lambda$
Events that occur at a particular rate, and the exponential is the waiting time between events
$X\sim Exp(\lambda )$
$f(x|\lambda )=\lambda e^{-\lambda x}$ for x >= 0
$E[X]={\frac {1}{\lambda }}$
$Var[X]={\frac {1}{\lambda ^{2}}}$

Normal

Standard Normal

$Z\sim N(0,1)$
$f(z)={\frac {1}{\sqrt {2\pi }}}{\textrm {exp}}\left(-{\frac {z^{2}}{2}}\right)$
$E[X]=0$
$Var[X]=1$

Parameterized Normal with mu and sigma

$Z\sim N(\mu ,\sigma ^{2})$
$f(x|\mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}{\textrm {exp}}\left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right)$

t distribution

Use if you don't know the true value of sigma. Replacing with sample standard deviation causes
Uses gamma distribution

Gamma

Total waiting time for all events to occur, for more than in random variable with an exponential distribution.

Beta

Used for random variables which take on values between 0 and 1. Commonly used to model probabilities.

Cumulative Distribution Function

CDF exists for every distribution
It's convenient for calculating probabilities of intervals, e.g., P( -1 < X < 1 )
${\textrm {CDF}}=F(x)=P(X\leq x)$ ${\textrm {CDF}}=F(x)=P(X\leq x)$
- $F(x)=\sum _{t=-\infty }^{x}f(t)$ - for PMF discrete distributions
- $F(x)=\int _{-\infty }^{x}f(t)dt$ where f(t) is probability density function

R functions

Normal distribution

dnorm( x, mean, sd ) - Evaluate PDF at x with mean = $\mu$ and sd = ${\sqrt {\sigma ^{2}}}$
pnorm( q, mean, sd ) - Evaluate CDF at q
qnorm( p, mean, sd ) - Evaluate the quantile function at p
rnorm( n, mean, sd ) - Evaluate n pseudo-random samples from the normal distribution.

Various distributions

dbinom(x, size, prob)
dpois(x, lambda)
dexp(x, rate)
dgamma(x, shape, rate)
dunif(x, min, max)
dbeta(x, shape1, shape2)
dnorm(x, mean, sd)
dt(x, df) where df $=\nu$

Frequntist approach to SI

View the data as a RANDOM sample from some larger population
reference population - the larger supergroup of instance for which you're trying to generalize to based on the sample

Confidence Intervals

Central Limit Theorem says the sum of all the xi's will follow approximately a Gaussian distribution
95% of the time we get a result within 1.96 standard deviations of the mean.
Use mean and standard deviation to define confidence intervals for a given level of confidence
We're 95% confident that the true population wide mean is on that interval
E.g., "It's plausible (supported by the data) that the coin is fair because 0.5 is on the interval."
Frequentist interpretation: In an infinite hypothetical sequence of events and we repeat this trial an infinite number of times, each time creating confidence interval based on the data we observe. On average, 95% of the intervals we make will contain the true value of p.
But what about THIS interval? Does this interval contain the true value of p? What's the probability that this interval contains contains a true p? From the frequentist perspective, the probability that p is on the interval is either 0 or 1

Bernoulli Example

Two coins, a loaded coin with P( heads ) = 0.7, and fair coin

What is the probability that the coin is loaded after you observe 5 trials?

Which coin do you think it is, and how sure are you about that?

$\theta =\{{\textrm {fair}},{\textrm {loaded}}\}$ $\theta =\{{\textrm {fair}},{\textrm {loaded}}\}$
- Theta is the unknown parameter

$X\sim Bin(5,?)$ $X\sim Bin(5,?)$
- The data is 5 flips and the question is what's that probability.

$f(x|\theta )={5 \choose x}(0.5)^{5}I_{\theta ={\textrm {fair}}}+{5 \choose x}(0.7)^{x}(0.3)^{5-x}I_{\theta ={\textrm {loaded}}}$ $f(x|\theta )={5 \choose x}(0.5)^{5}I_{\theta ={\textrm {fair}}}+{5 \choose x}(0.7)^{x}(0.3)^{5-x}I_{\theta ={\textrm {loaded}}}$
- This is the likelihood function using indicator (step) function notation

Say we observe two heads X=2, what's our likelihood (using f as notation instead of l) $f(\theta |X=2)$ $f(\theta |X=2)$
- plug in X=2, get 0.3125 if $\theta ={\textrm {fair}}$ and 0.1323 if $\theta ={\textrm {loaded}}$
- Having observed two heads, we can say that the likelihood is higher for theta = fair than theta = loaded.
- ${\textrm {MLE}}({\hat {\theta }})={\textrm {fair}}$ - given the data, it's most likely that the coin is fair

This is a point estimate, but how to answer the question how sure are you?

Another question: what is $P(\theta =fair|X=2)$ $P(\theta =fair|X=2)$
- In frequentist paradigm, the coin is a physical quantity, it's a fixed coin, and therefor has a fixed probability of coming up heads
- In this interpretation $P(\theta =fair|X=2)=P(\theta =fair)$ and this probability = $\in \{0,1\}$

Bayesian Inference

An advantage of the Bayesian approach is that it allows you to easily incorporate prior information, when you know something in advance of the looking at the data.
Bayesian approach is inherently subjective

Bernoulli Example

What is the probability that the coin is loaded before and after you observe 5 trials?

Prior $P(loaded)=0.6$

$f(\theta |x)={\frac {f(x|\theta )f(\theta )}{\sum _{\theta }\left[f(x|\theta )f(\theta )\right]}}$ $f(\theta |x)={\frac {f(x|\theta )f(\theta )}{\sum _{\theta }\left[f(x|\theta )f(\theta )\right]}}$
- $f(\theta |x)$ is the updated posterior
- Here we state likelihood function as f(x) rather than l(x)
- Denominator is the sum over all possibilities of theta, of which there are only two: $\theta \in \{loaded,notloaded\}$ . Denominator is the normalizing constant so that the coefficients in the numerator sum to 1.

$f(\theta |x)={\frac {{5 \choose x}\left[(0.5)^{5}(0.4)I_{\theta ={\textrm {fair}}}+(0.7)^{x}(0.3)^{5-x}(0.6)I_{\theta ={\textrm {loaded}}}\right]}{{5 \choose x}\left[(0.5)^{5}(0.4)+(0.7)^{x}(0.3)^{5-x}(0.6)\right]}}$ $f(\theta |x)={\frac {{5 \choose x}\left[(0.5)^{5}(0.4)I_{\theta ={\textrm {fair}}}+(0.7)^{x}(0.3)^{5-x}(0.6)I_{\theta ={\textrm {loaded}}}\right]}{{5 \choose x}\left[(0.5)^{5}(0.4)+(0.7)^{x}(0.3)^{5-x}(0.6)\right]}}$
- Multiply first term times the prior probability that the coin is fair, i.e., 0.4, and the second term that the coin is loaded.
$f(\theta |X=2)={\frac {0.0125I_{\theta ={\textrm {fair}}}+0.0079I_{\theta ={\textrm {loaded}}}}{0.125+0.0079}}=0.612I_{\theta ={\textrm {fair}}}+0.388I_{\theta ={\textrm {loaded}}}$

Tests for Categorical Data

Goodness of fit for single categorical variable
- compare observed counts to the expected counts "contribution terms" for
- Get relative distance the observed are from the expected
- Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
- If null hypothesis is true, observed is close to expected
- Relative distance the observed are to the expected
- Test statistic has chi-squared distribution.

 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);
testf=(values)

Tests for two-way variables

test for homogeneity - distribution of proportions are the same across the populations
test of independence -

 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
testf=(values)

Use fisher's exact test if sample num is small.
- R: fisher.test(table)
cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
weight statement indicates the variable in the table

T-Test

"Student's t-distribution"
When data are normally distributed
Can test hypotheses about the mean/center of the distribution

One-sample t-test

Test is mean greater than/less than/equal to some value
SAS proc means

Two-Sample t-test

Test whether two population means are equal.
Unpaired or independent samples t-test: Are the variances the same?
- If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
- If yes it's called a "pooled t-test" or "Student's t-test"
- F-statistic tests whether the variances are equal
Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

Nonparametric tests

Hypothesis testing when you can't assume data comes from normal distribution
a lot of non-parametric approaches are based on ranks
do not depend on normality
Where as the other test are really tests for means, npar tests are actually for medians

One-sample tests

SAS proc univariate for these
Sign test
- Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
- PAIRED observations with test x > y, x = y, or x < y.
- Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
- Does one member of the pair tend to be greater than the other?
- Does NOT assume symmetric distribution of the differences around the median
- Does NOT use the magnitude of the difference
Wilcoxon Signed Ranks Test
- A quantitative Sign Test
- DOES use magnitude of difference of paired observations
Confidence interval based on signed rank test
- what are the set of values for which you wouldn't have rejected the null hypothesis

Two or more sample nonparametric tests

Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
use deviations for the median and use the signed ranks

SAS: proc npar1way wilcoxon
- Class variable is used for the two or more groups
- Otherwise use proc npar1way anova
Wilcoxon Rank Sum Test/Mann-Whitney U statistic
- Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
- Equivalent of unequal variances t-test
- R: wilcox.test
- Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
- Can also do confidence interval
Kruskal-Wallis
- Non-parametric method for testing whether samples originate from the same distribution
- equivalent to One-way ANOVA

Goodness of fit for continuous distributions

one sample

empirical cumulative distribution function, compare to theoretical
- R: ecdf(data)
Kolmogorov-Smirnov
- Not quite as good, because this just gives a max of the W statistic
Do not estimate parameters from the data
R: ks.test(x, y="name")

Two-Sample

Could have two distributions with the same mean but different shapes.
R: ks.test(X, Y)

Estimating Parameter Values

R: MASS package, fitdistr(data, densfun="exponential")
- obtain maximum likelihood estimate

Kernel Smoothing Density Function

Matlab function
[f,xi,u] = ksdensity(x)
Computes a probability density estimate of the sample in the vector x
f is the vector of density values evaluated at the points xi.
u is the width of the kernal -smoothing window, which is calculated.

Linear Discriminant Analysis

Linear Discriminant Analysis

Receiver operating characteristic curve

wiki article

Linear Regression

Linear Regression - Wikipedia article
y = X beta + epsilon
y = the regressand, dependent variable.
X = the design matrix. x sub i are regressors
each x sub i has a corresponding beta sub i called the intercept
beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept

DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

Regression Diagnostics

multicollinearity -> VIF
heteroscedasticity -> Scale-Location or Residual vs fitted
Outliers -> Residuals vs Leverage or Leverage vs Cook's D
Non-linearity -> Residual vs fitted
Residual distribution -> Q-Q Plot
Understanding Regression Diagnostic Plots
R: Use ggfortify ::autoplot

Eigen vector & Eigen Value

Eigen values and eigenvectors

Mixed and Multilevel Models

Set theory symbols

Set theory symbols
$\varnothing$ : \varnothing, empty set
$\mid$ : \mid, satisfies the condition
$\cup$ : \cup, union
$\cap$ : \cap, intersection
$\setminus$ : \setminus
$\triangle$ : \triangle, symmetric difference
$\in$ : \in - left side element is in right side set
$\cdot$ : \cdot, dot product, vector and matrix multiplication, scalar result
$\times$ : \times, cross product of vectors
$\otimes$ : \otimes, kronecker (outer) product of tensor (matrix)

Bayesian Statistics

Bayesian Data Analysis

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Factor Analysis

number of variables too large
deviations or variation that is of most interest
reduce number of variables
consider linear combinations of the variables
keep the combos with large variance
discard the ones with small variance
latent variables explain the correlation between outcome variables
interpretability of factors is sometimes suspect
Used for exploratory data analysis
>10 obs per variable
Group variables into factors such that the variables are highly correlated
Use PCA to examine latent common factors (1st method)

Principle Component Analysis

Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
factor loadings which represent

Statistics

Contents

General

Basic Probability

Joint probability (intersection)

Marginal probability

Union of two events

Conditional probability

Variable Independence

Error bars

null hypothesis

Probability Distributions

Discrete Distributions

Bernoulli

Binomial

Geometric

Multinomial

Poisson

Continuous Distributions

Uniform

Exponential

Normal

Standard Normal

Parameterized Normal with mu and sigma

t distribution

Gamma

Beta

Cumulative Distribution Function

R functions

Normal distribution

Various distributions

Frequntist approach to SI

Confidence Intervals

Bernoulli Example

Bayesian Inference

Bernoulli Example

Tests for Categorical Data

Tests for two-way variables

T-Test

One-sample t-test

Two-Sample t-test

Nonparametric tests

One-sample tests

Two or more sample nonparametric tests

Goodness of fit for continuous distributions

one sample

Two-Sample

Estimating Parameter Values

Kernel Smoothing Density Function

Linear Discriminant Analysis

Receiver operating characteristic curve

Linear Regression

Regression Diagnostics

Eigen vector & Eigen Value

Mixed and Multilevel Models

Set theory symbols

Bayesian Statistics

Maximum Likelihood Estimation

Factor Analysis

Principle Component Analysis

Navigation menu

Search