Difference between revisions of "Statistics"
Jump to navigation
Jump to search
Tags: Mobile edit Mobile web edit |
Tags: Mobile edit Mobile web edit |
||
Line 50: | Line 50: | ||
** PMF for discrete and probability density function (PDF) if continuous. Can view everything as a density, though. | ** PMF for discrete and probability density function (PDF) if continuous. Can view everything as a density, though. | ||
− | ===Discrete=== | + | ===Discrete Distributions=== |
====Bernoulli==== | ====Bernoulli==== | ||
Line 69: | Line 69: | ||
* Binomial approximation standard score - how many standard deviations an observation is above or below the mean. | * Binomial approximation standard score - how many standard deviations an observation is above or below the mean. | ||
− | ===Continuous=== | + | ====Geometric==== |
+ | * The number of trials to observe a success | ||
+ | |||
+ | ====Multinomial==== | ||
+ | * Generalize bernoulli and binomial to more than one possible outcome | ||
+ | |||
+ | ====Poisson==== | ||
+ | * Used for counts | ||
+ | * parameter <math>\lambda \gt 0</math> is the rate at which we expect to observe the thing we are counting | ||
+ | |||
+ | ===Continuous Distributions=== | ||
* Integral from -inf to inf = 1: "The probability that something happens = 1" | * Integral from -inf to inf = 1: "The probability that something happens = 1" | ||
* f(x) >= 0: Densities are non-negative for all possible values of x | * f(x) >= 0: Densities are non-negative for all possible values of x | ||
Line 84: | Line 94: | ||
====Exponential==== | ====Exponential==== | ||
− | |||
* E.g, a bus that comes every 10 minutes, the exponential is your waiting time | * E.g, a bus that comes every 10 minutes, the exponential is your waiting time | ||
* Rate parameter <math>\lambda</math> | * Rate parameter <math>\lambda</math> | ||
Line 92: | Line 101: | ||
* <math>E[X] = \frac{1}{\lambda}</math> | * <math>E[X] = \frac{1}{\lambda}</math> | ||
* <math>Var[X] = \frac{1}{\lambda^2}</math> | * <math>Var[X] = \frac{1}{\lambda^2}</math> | ||
+ | |||
====Normal==== | ====Normal==== | ||
− | + | ||
+ | =====Standard Normal===== | ||
+ | * <math>Z \sim N(0,1)</math> | ||
+ | * <math>f(z) = \frac{1}{ \sqrt{2 \pi} } \textrm{exp} \left( - \frac{z^2}{2} \right) </math> | ||
+ | * <math>E[X] = 0</math> | ||
+ | * <math>Var[X] = 1</math> | ||
+ | |||
+ | =====Parameterized Normal with mu and sigma===== | ||
+ | * <math>Z \sim N(\mu,\sigma^2)</math> | ||
+ | * <math>f( x | \mu, \sigma^2 ) = \frac{1}{ \sqrt{2 \pi \sigma^2} } \textrm{exp} \left( - \frac{ (x - \mu)^2 }{2 \sigma^2} \right) </math> | ||
+ | |||
+ | ==== t distribution ==== | ||
+ | * Use if you don't know the true value of sigma. Replacing with sample standard deviation causes | ||
+ | * Uses gamma distribution | ||
+ | |||
====Gamma==== | ====Gamma==== | ||
− | + | * Total waiting time for all events to occur, for more than in random variable with an exponential distribution. | |
+ | |||
====Beta==== | ====Beta==== | ||
− | + | * Used for random variables which take on values between 0 and 1. Commonly used to model probabilities. | |
==Tests for Categorical Data== | ==Tests for Categorical Data== |
Revision as of 03:42, 8 May 2020
Contents
- 1 General
- 2 Basic Probability
- 3 Error bars
- 4 null hypothesis
- 5 Probability Distributions
- 6 Tests for Categorical Data
- 7 T-Test
- 8 Nonparametric tests
- 9 Goodness of fit for continuous distributions
- 10 Estimating Parameter Values
- 11 Kernel Smoothing Density Function
- 12 Linear Discriminant Analysis
- 13 Receiver operating characteristic curve
- 14 Linear Regression
- 15 Eigen vector & Eigen Value
- 16 Maximum Likelihood Estimate
- 17 Set theory symbols
- 18 Bayes
- 19 Factor Analysis
- 20 Principle Component Analysis
General
- degrees of freedom = the number of values in the final calculation that are free to vary.
- residuals = for each observation residual is the difference between that observation and the average of all the observations.
- the sum of the residuals is necessarily 0.
- probabilty mass function = pmf is for for DISCRETE random variables
Basic Probability
Joint probability (intersection)
- The probability of both events happening together.
- The joint probability of A and B is written , P(AB) or P(A, B)
- latex set intersection sign \cap - Python operator & ampersand
- Joint event = depends on classes from two different variables
- Joint probability distribution for categorical variables - list out in a table, all numbers sum to 1. Marginal tallies is sum of joint probs, ignores one of the variables.
Marginal probability
- Essentially the opposite of conditional probability.
- If there are two possible outcomes for X with corresponding events B and B', this means that .
- column rank and row rank
Union of two events
Conditional probability
- The probability of some event A, given the occurrence of some other event B.
- Conditional probability is written
- Joint probability divided by the marginal probability
- Conditional probability distribution - adds up to 1
- Compare with marginal probability distribution
Variable Independence
- If independent then
- Imposing the condition B doesn't affect the probability of A at all.
- In a sample, we would expect the two probabilities to not match up slightly anyway
- If independent then the joint probability/intersection/upside-down U
Error bars
null hypothesis
The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.
Probability Distributions
- Notation convention
- Capital roman letter - a random variable
- Lowercase roman letter - a possible value it might take
- Unknown values represented with greek letters rather than roman
- Probability mass function - gives the probability of different outcomes of the random variable
- PMF for discrete and probability density function (PDF) if continuous. Can view everything as a density, though.
Discrete Distributions
Bernoulli
- Only 2 possible outcomes, success or failure
- , ,
- Expected value
- Var(X)=p(1-p)
Binomial
- Binomial = Generalization of Bernoulli where you have N repeated trials. The sum of N independent Bernoullis
- X ~ Bin( n, p )
- n = number of trials
- P( X=x | p ) = f(x|p) = (n choose x) * p^x * (1-p) ^ (n-x)
- N choose x is combinatorial = n! / ( x!*(n-x)!)
- Expected value E[X] = np
- Var(X)=np(1-p)
- Binomial approximation standard score - how many standard deviations an observation is above or below the mean.
Geometric
- The number of trials to observe a success
Multinomial
- Generalize bernoulli and binomial to more than one possible outcome
Poisson
- Used for counts
- parameter Failed to parse (unknown function "\gt"): {\displaystyle \lambda \gt 0} is the rate at which we expect to observe the thing we are counting
Continuous Distributions
- Integral from -inf to inf = 1: "The probability that something happens = 1"
- f(x) >= 0: Densities are non-negative for all possible values of x
- E[X] = Integral from -inf to inf x f(x) dx
- Analogous to the sum we have in discrete variables
Uniform
-
- f(x) = Indicator function where condition is x is on interval
-
- Integrate to ascertain probability between two given values
- probability that x= some given value is vanishingly small, integrate from x to x = 0
Exponential
- E.g, a bus that comes every 10 minutes, the exponential is your waiting time
- Rate parameter
- Events that occur at a particular rate, and the exponential is the waiting time between events
- for x >= 0
Normal
Standard Normal
Parameterized Normal with mu and sigma
t distribution
- Use if you don't know the true value of sigma. Replacing with sample standard deviation causes
- Uses gamma distribution
Gamma
- Total waiting time for all events to occur, for more than in random variable with an exponential distribution.
Beta
- Used for random variables which take on values between 0 and 1. Commonly used to model probabilities.
Tests for Categorical Data
- Goodness of fit for single categorical variable
- compare observed counts to the expected counts "contribution terms" for
- Get relative distance the observed are from the expected
- Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
- If null hypothesis is true, observed is close to expected
- Relative distance the observed are to the expected
- Test statistic has chi-squared distribution.
proc freq data=<whatevs>; table vvar1 / chisq; table var2 / chisq testp=(values); testf=(values)
Tests for two-way variables
- test for homogeneity - distribution of proportions are the same across the populations
- test of independence -
proc freq data=<whatevs>; table vvar1 / chisq exact or Fisher; table var2 / chisq cellchi2; testf=(values)
- Use fisher's exact test if sample num is small.
- R:
fisher.test(table)
- R:
- cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
- weight statement indicates the variable in the table
T-Test
- "Student's t-distribution"
- When data are normally distributed
- Can test hypotheses about the mean/center of the distribution
One-sample t-test
- Test is mean greater than/less than/equal to some value
- SAS proc means
Two-Sample t-test
- Test whether two population means are equal.
- Unpaired or independent samples t-test: Are the variances the same?
- If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
- If yes it's called a "pooled t-test" or "Student's t-test"
- F-statistic tests whether the variances are equal
- Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?
Nonparametric tests
- Hypothesis testing when you can't assume data comes from normal distribution
- a lot of non-parametric approaches are based on ranks
- do not depend on normality
- Where as the other test are really tests for means, npar tests are actually for medians
One-sample tests
- SAS proc univariate for these
- Sign test
- Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
- PAIRED observations with test x > y, x = y, or x < y.
- Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
- Does one member of the pair tend to be greater than the other?
- Does NOT assume symmetric distribution of the differences around the median
- Does NOT use the magnitude of the difference
- Wilcoxon Signed Ranks Test
- A quantitative Sign Test
- DOES use magnitude of difference of paired observations
- Confidence interval based on signed rank test
- what are the set of values for which you wouldn't have rejected the null hypothesis
Two or more sample nonparametric tests
- Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
- use deviations for the median and use the signed ranks
- SAS:
proc npar1way wilcoxon
- Class variable is used for the two or more groups
- Otherwise use
proc npar1way anova
- Wilcoxon Rank Sum Test/Mann-Whitney U statistic
- Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
- Equivalent of unequal variances t-test
- R:
wilcox.test
- Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
- Can also do confidence interval
- Kruskal-Wallis
- Non-parametric method for testing whether samples originate from the same distribution
- equivalent to One-way ANOVA
Goodness of fit for continuous distributions
one sample
- empirical cumulative distribution function, compare to theoretical
- R:
ecdf(data)
- R:
- Kolmogorov-Smirnov
- Not quite as good, because this just gives a max of the W statistic
- Do not estimate parameters from the data
- R:
ks.test(x, y="name")
Two-Sample
- Could have two distributions with the same mean but different shapes.
- R:
ks.test(X, Y)
Estimating Parameter Values
- R: MASS package,
fitdistr(data, densfun="exponential")
- obtain maximum likelihood estimate
Kernel Smoothing Density Function
- Matlab function
- [f,xi,u] = ksdensity(x)
- Computes a probability density estimate of the sample in the vector x
- f is the vector of density values evaluated at the points xi.
- u is the width of the kernal -smoothing window, which is calculated.
Linear Discriminant Analysis
Receiver operating characteristic curve
Linear Regression
- Linear Regression - Wikipedia article
- y = X beta + epsilon
- y = the regressand, dependent variable.
- X = the design matrix. x sub i are regressors
- each x sub i has a corresponding beta sub i called the intercept
- beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
- DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.
Regression Diagnostics
- multicollinearity -> VIF
- heteroscedasticity -> Scale-Location or Residual vs fitted
- Outliers -> Residuals vs Leverage or Leverage vs Cook's D
- Non-linearity -> Residual vs fitted
- Residual distribution -> Q-Q Plot
- Understanding Regression Diagnostic Plots
- R: Use ggfortify ::autoplot
Eigen vector & Eigen Value
Maximum Likelihood Estimate
Set theory symbols
- Set theory symbols
- : \varnothing, empty set
- : \mid, satisfies the condition
- : \cup, union
- : \cap, intersection
- : \setminus
- : \triangle, symmetric difference
- : \in - left side element is in right side set
- : \cdot, dot product, vector and matrix multiplication, scalar result
- : \times, cross product of vectors
- : \otimes, kronecker (outer) product of tensor (matrix)
Bayes
Factor Analysis
- number of variables too large
- deviations or variation that is of most interest
- reduce number of variables
- consider linear combinations of the variables
- keep the combos with large variance
- discard the ones with small variance
- latent variables explain the correlation between outcome variables
- interpretability of factors is sometimes suspect
- Used for exploratory data analysis
- >10 obs per variable
- Group variables into factors such that the variables are highly correlated
- Use PCA to examine latent common factors (1st method)
Principle Component Analysis
- Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
- factor loadings which represent