Statistics
Jump to navigation
Jump to search
Contents
- 1 General
- 2 Basic Probability
- 3 Error bars
- 4 null hypothesis
- 5 Probability Distributions
- 6 Frequntist approach to SI
- 7 Bayesian Inference
- 8 Tests for Categorical Data
- 9 T-Test
- 10 Nonparametric tests
- 11 Goodness of fit for continuous distributions
- 12 Estimating Parameter Values
- 13 Kernel Smoothing Density Function
- 14 Linear Discriminant Analysis
- 15 Receiver operating characteristic curve
- 16 Linear Regression
- 17 Eigen vector & Eigen Value
- 18 Mixed and Multilevel Models
- 19 Set theory symbols
- 20 Bayesian Statistics
- 21 Maximum Likelihood Estimation
- 22 Factor Analysis
- 23 Principle Component Analysis
General
- degrees of freedom = the number of values in the final calculation that are free to vary.
- residuals = for each observation residual is the difference between that observation and the average of all the observations.
- the sum of the residuals is necessarily 0.
- probabilty mass function = pmf is for for DISCRETE random variables
Basic Probability
Joint probability (intersection)
- The probability of both events happening together.
- The joint probability of A and B is written , P(AB) or P(A, B)
- latex set intersection sign \cap - Python operator & ampersand
- Joint event = depends on classes from two different variables
- Joint probability distribution for categorical variables - list out in a table, all numbers sum to 1. Marginal tallies is sum of joint probs, ignores one of the variables.
Marginal probability
- Essentially the opposite of conditional probability.
- If there are two possible outcomes for X with corresponding events B and B', this means that .
- column rank and row rank
Union of two events
Conditional probability
- The probability of some event A, given the occurrence of some other event B.
- Conditional probability is written
- Joint probability divided by the marginal probability
- Conditional probability distribution - adds up to 1
- Compare with marginal probability distribution
Variable Independence
- If independent then
- Imposing the condition B doesn't affect the probability of A at all.
- In a sample, we would expect the two probabilities to not match up slightly anyway
- If independent then the joint probability/intersection/upside-down U
Error bars
null hypothesis
The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.
Probability Distributions
- Notation convention
- Capital roman letter - a random variable
- Lowercase roman letter - a possible value it might take
- Unknown values represented with greek letters rather than roman
- Probability mass function - gives the probability of different outcomes of the random variable
- PMF for discrete and probability density function (PDF) if continuous. Can view everything as a density, though.
Discrete Distributions
Bernoulli
- Only 2 possible outcomes, success or failure
- , ,
-
- Expected value
Binomial
- Binomial = Generalization of Bernoulli where you have N repeated trials. The sum of N independent Bernoullis
-
- n = number of trials
-
-
- "n choose x" is combinatorial
-
-
- Expected value
- Binomial approximation standard score - how many standard deviations an observation is above or below the mean.
Geometric
- The number of trials to observe a success
Multinomial
- Generalize bernoulli and binomial to more than one possible outcome
Poisson
- Used for counts
- parameter is the rate at which we expect to observe the thing we are counting
Continuous Distributions
- : "The probability that something happens = 1"
- f(x) >= 0: Densities are non-negative for all possible values of x
- E[X] = Integral from -inf to inf x f(x) dx
- Analogous to the sum we have in discrete variables
Uniform
-
- f(x) = Indicator function where condition is x is on interval
-
- Integrate to ascertain probability between two given values
- probability that x= some given value is vanishingly small, integrate from x to x = 0
Exponential
- E.g, a bus that comes every 10 minutes, the exponential is your waiting time
- Rate parameter
- Events that occur at a particular rate, and the exponential is the waiting time between events
- for x >= 0
Normal
Standard Normal
Parameterized Normal with mu and sigma
t distribution
- Use if you don't know the true value of sigma. Replacing with sample standard deviation causes
- Uses gamma distribution
Gamma
- Total waiting time for all events to occur, for more than in random variable with an exponential distribution.
Beta
- Used for random variables which take on values between 0 and 1. Commonly used to model probabilities.
Cumulative Distribution Function
- CDF exists for every distribution
- It's convenient for calculating probabilities of intervals, e.g., P( -1 < X < 1 )
-
- - for PMF discrete distributions
- where f(t) is probability density function
R functions
Normal distribution
dnorm( x, mean, sd )
- Evaluate PDF at x with mean = and sd =pnorm( q, mean, sd )
- Evaluate CDF at qqnorm( p, mean, sd )
- Evaluate the quantile function at prnorm( n, mean, sd )
- Evaluate n pseudo-random samples from the normal distribution.
Various distributions
dbinom(x, size, prob)
dpois(x, lambda)
dexp(x, rate)
dgamma(x, shape, rate)
dunif(x, min, max)
dbeta(x, shape1, shape2)
dnorm(x, mean, sd)
dt(x, df)
wheredf
Frequntist approach to SI
- View the data as a RANDOM sample from some larger population
- reference population - the larger supergroup of instance for which you're trying to generalize to based on the sample
Confidence Intervals
- Central Limit Theorem says the sum of all the xi's will follow approximately a Gaussian distribution
- 95% of the time we get a result within 1.96 standard deviations of the mean.
- Use mean and standard deviation to define confidence intervals for a given level of confidence
- We're 95% confident that the true population wide mean is on that interval
- E.g., "It's plausible (supported by the data) that the coin is fair because 0.5 is on the interval."
- Frequentist interpretation: In an infinite hypothetical sequence of events and we repeat this trial an infinite number of times, each time creating confidence interval based on the data we observe. On average, 95% of the intervals we make will contain the true value of p.
- But what about THIS interval? Does this interval contain the true value of p? What's the probability that this interval contains contains a true p? From the frequentist perspective, the probability that p is on the interval is either 0 or 1
Bernoulli Example
- Two coins, a loaded coin with P( heads ) = 0.7, and fair coin
- What is the probability that the coin is loaded after you observe 5 trials?
- Which coin do you think it is, and how sure are you about that?
-
- Theta is the unknown parameter
-
- The data is 5 flips and the question is what's that probability.
-
- This is the likelihood function using indicator (step) function notation
- Say we observe two heads X=2, what's our likelihood (using f as notation instead of l)
- plug in X=2, get 0.3125 if and 0.1323 if
- Having observed two heads, we can say that the likelihood is higher for theta = fair than theta = loaded.
- - given the data, it's most likely that the coin is fair
- This is a point estimate, but how to answer the question how sure are you?
- Another question: what is
- In frequentist paradigm, the coin is a physical quantity, it's a fixed coin, and therefor has a fixed probability of coming up heads
- In this interpretation and this probability =
Bayesian Inference
- An advantage of the Bayesian approach is that it allows you to easily incorporate prior information, when you know something in advance of the looking at the data.
- Bayesian approach is inherently subjective
Bernoulli Example
- What is the probability that the coin is loaded before and after you observe 5 trials?
- Prior
-
- is the updated posterior
- Here we state likelihood function as f(x) rather than l(x)
- Denominator is the sum over all possibilities of theta, of which there are only two: . Denominator is the normalizing constant so that the coefficients in the numerator sum to 1.
-
- Multiply first term times the prior probability that the coin is fair, i.e., 0.4, and the second term that the coin is loaded.
Tests for Categorical Data
- Goodness of fit for single categorical variable
- compare observed counts to the expected counts "contribution terms" for
- Get relative distance the observed are from the expected
- Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
- If null hypothesis is true, observed is close to expected
- Relative distance the observed are to the expected
- Test statistic has chi-squared distribution.
proc freq data=<whatevs>; table vvar1 / chisq; table var2 / chisq testp=(values); testf=(values)
Tests for two-way variables
- test for homogeneity - distribution of proportions are the same across the populations
- test of independence -
proc freq data=<whatevs>; table vvar1 / chisq exact or Fisher; table var2 / chisq cellchi2; testf=(values)
- Use fisher's exact test if sample num is small.
- R:
fisher.test(table)
- R:
- cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
- weight statement indicates the variable in the table
T-Test
- "Student's t-distribution"
- When data are normally distributed
- Can test hypotheses about the mean/center of the distribution
One-sample t-test
- Test is mean greater than/less than/equal to some value
- SAS proc means
Two-Sample t-test
- Test whether two population means are equal.
- Unpaired or independent samples t-test: Are the variances the same?
- If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
- If yes it's called a "pooled t-test" or "Student's t-test"
- F-statistic tests whether the variances are equal
- Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?
Nonparametric tests
- Hypothesis testing when you can't assume data comes from normal distribution
- a lot of non-parametric approaches are based on ranks
- do not depend on normality
- Where as the other test are really tests for means, npar tests are actually for medians
One-sample tests
- SAS proc univariate for these
- Sign test
- Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
- PAIRED observations with test x > y, x = y, or x < y.
- Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
- Does one member of the pair tend to be greater than the other?
- Does NOT assume symmetric distribution of the differences around the median
- Does NOT use the magnitude of the difference
- Wilcoxon Signed Ranks Test
- A quantitative Sign Test
- DOES use magnitude of difference of paired observations
- Confidence interval based on signed rank test
- what are the set of values for which you wouldn't have rejected the null hypothesis
Two or more sample nonparametric tests
- Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
- use deviations for the median and use the signed ranks
- SAS:
proc npar1way wilcoxon
- Class variable is used for the two or more groups
- Otherwise use
proc npar1way anova
- Wilcoxon Rank Sum Test/Mann-Whitney U statistic
- Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
- Equivalent of unequal variances t-test
- R:
wilcox.test
- Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
- Can also do confidence interval
- Kruskal-Wallis
- Non-parametric method for testing whether samples originate from the same distribution
- equivalent to One-way ANOVA
Goodness of fit for continuous distributions
one sample
- empirical cumulative distribution function, compare to theoretical
- R:
ecdf(data)
- R:
- Kolmogorov-Smirnov
- Not quite as good, because this just gives a max of the W statistic
- Do not estimate parameters from the data
- R:
ks.test(x, y="name")
Two-Sample
- Could have two distributions with the same mean but different shapes.
- R:
ks.test(X, Y)
Estimating Parameter Values
- R: MASS package,
fitdistr(data, densfun="exponential")
- obtain maximum likelihood estimate
Kernel Smoothing Density Function
- Matlab function
- [f,xi,u] = ksdensity(x)
- Computes a probability density estimate of the sample in the vector x
- f is the vector of density values evaluated at the points xi.
- u is the width of the kernal -smoothing window, which is calculated.
Linear Discriminant Analysis
Receiver operating characteristic curve
Linear Regression
- Linear Regression - Wikipedia article
- y = X beta + epsilon
- y = the regressand, dependent variable.
- X = the design matrix. x sub i are regressors
- each x sub i has a corresponding beta sub i called the intercept
- beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
- DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.
Regression Diagnostics
- multicollinearity -> VIF
- heteroscedasticity -> Scale-Location or Residual vs fitted
- Outliers -> Residuals vs Leverage or Leverage vs Cook's D
- Non-linearity -> Residual vs fitted
- Residual distribution -> Q-Q Plot
- Understanding Regression Diagnostic Plots
- R: Use ggfortify ::autoplot
Eigen vector & Eigen Value
Mixed and Multilevel Models
Set theory symbols
- Set theory symbols
- : \varnothing, empty set
- : \mid, satisfies the condition
- : \cup, union
- : \cap, intersection
- : \setminus
- : \triangle, symmetric difference
- : \in - left side element is in right side set
- : \cdot, dot product, vector and matrix multiplication, scalar result
- : \times, cross product of vectors
- : \otimes, kronecker (outer) product of tensor (matrix)
Bayesian Statistics
Maximum Likelihood Estimation
Factor Analysis
- number of variables too large
- deviations or variation that is of most interest
- reduce number of variables
- consider linear combinations of the variables
- keep the combos with large variance
- discard the ones with small variance
- latent variables explain the correlation between outcome variables
- interpretability of factors is sometimes suspect
- Used for exploratory data analysis
- >10 obs per variable
- Group variables into factors such that the variables are highly correlated
- Use PCA to examine latent common factors (1st method)
Principle Component Analysis
- Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
- factor loadings which represent