Difference between revisions of "Statistics"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
+ | ==General== | ||
+ | * [http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics) degrees of freedom] = the number of values in the final calculation that are free to vary. | ||
+ | * residuals = for each observation residual is the difference between that observation and the average of all the observations. | ||
+ | ** the sum of the residuals is necessarily 0. | ||
+ | * [http://en.wikipedia.org/wiki/Probability_mass_function probabilty mass function] = pmf is for for DISCRETE random variables | ||
+ | |||
+ | |||
==Basic Probability== | ==Basic Probability== | ||
+ | |||
===Conditional probability=== | ===Conditional probability=== | ||
* The probability of some event A, given the occurrence of some other event B. | * The probability of some event A, given the occurrence of some other event B. | ||
* Conditional probability is written <math>P(A|B)</math>, and is read "the (conditional) probability of A, given B" | * Conditional probability is written <math>P(A|B)</math>, and is read "the (conditional) probability of A, given B" | ||
+ | |||
===Joint probability=== | ===Joint probability=== | ||
* The probability of both events happening together. | * The probability of both events happening together. | ||
Line 12: | Line 21: | ||
===Marginal probability=== | ===Marginal probability=== | ||
* Essentially the opposite of conditional probability. | * Essentially the opposite of conditional probability. | ||
− | * If there are two possible outcomes for X with corresponding events B and B', this means that | + | * If there are two possible outcomes for X with corresponding events B and B', this means that <math> P(A) = P(A \cap B) + P(A \cap B^')</math>. |
* column rank and row rank | * column rank and row rank | ||
− | + | ||
− | * | + | ===Union of two events=== |
− | + | * <math>P( A \cup B) = P(A) + P(B) - P( A \cap B )</math> | |
− | |||
− | |||
==Error bars== | ==Error bars== |
Revision as of 23:57, 1 October 2019
Contents
- 1 General
- 2 Basic Probability
- 3 Error bars
- 4 null hypothesis
- 5 binomial probability
- 6 Tests for Categorical Data
- 7 T-Test
- 8 Nonparametric tests
- 9 Goodness of fit for continuous distributions
- 10 Estimating Parameter Values
- 11 Kernel Smoothing Density Function
- 12 Linear Discriminant Analysis
- 13 Receiver operating characteristic curve
- 14 Linear Regression
- 15 Eigen vector & Eigen Value
- 16 Maximum Likelihood Estimate
- 17 Set theory symbols
- 18 Bayes
- 19 Factor Analysis
- 20 Principle Component Analysis
General
- degrees of freedom = the number of values in the final calculation that are free to vary.
- residuals = for each observation residual is the difference between that observation and the average of all the observations.
- the sum of the residuals is necessarily 0.
- probabilty mass function = pmf is for for DISCRETE random variables
Basic Probability
Conditional probability
- The probability of some event A, given the occurrence of some other event B.
- Conditional probability is written , and is read "the (conditional) probability of A, given B"
Joint probability
- The probability of both events happening together.
- The joint probability of A and B is written , P(AB) or P(A, B)
- latex set intersection sign \cap
- Joint event = depends on classes from two different variables
- Joint probability distribution for categorical variables - list out in a table, all numbers sum to 1. Marginal tallies is sum of joint probs, ignores one of the variables.
Marginal probability
- Essentially the opposite of conditional probability.
- If there are two possible outcomes for X with corresponding events B and B', this means that .
- column rank and row rank
Union of two events
Error bars
null hypothesis
The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.
binomial probability
binomial distribution
- p = probability that event will occur
- q = probability that event won't occur
- p and q are complementary = p + q = 1
- n = number of trials
- k = number of successes
Binomial approximation
- standard score = how many standard deviations an observation is above or below the mean.
Tests for Categorical Data
- Goodness of fit for single categorical variable
- compare observed counts to the expected counts "contribution terms" for
- Get relative distance the observed are from the expected
- Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
- If null hypothesis is true, observed is close to expected
- Relative distance the observed are to the expected
- Test statistic has chi-squared distribution.
proc freq data=<whatevs>; table vvar1 / chisq; table var2 / chisq testp=(values); testf=(values)
Tests for two-way variables
- test for homogeneity - distribution of proportions are the same across the populations
- test of independence -
proc freq data=<whatevs>; table vvar1 / chisq exact or Fisher; table var2 / chisq cellchi2; testf=(values)
- Use fisher's exact test if sample num is small.
- R:
fisher.test(table)
- R:
- cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
- weight statement indicates the variable in the table
T-Test
- "Student's t-distribution"
- When data are normally distributed
- Can test hypotheses about the mean/center of the distribution
One-sample t-test
- Test is mean greater than/less than/equal to some value
- SAS proc means
Two-Sample t-test
- Test whether two population means are equal.
- Unpaired or independent samples t-test: Are the variances the same?
- If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
- If yes it's called a "pooled t-test" or "Student's t-test"
- F-statistic tests whether the variances are equal
- Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?
Nonparametric tests
- Hypothesis testing when you can't assume data comes from normal distribution
- a lot of non-parametric approaches are based on ranks
- do not depend on normality
- Where as the other test are really tests for means, npar tests are actually for medians
One-sample tests
- SAS proc univariate for these
- Sign test
- Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
- PAIRED observations with test x > y, x = y, or x < y.
- Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
- Does one member of the pair tend to be greater than the other?
- Does NOT assume symmetric distribution of the differences around the median
- Does NOT use the magnitude of the difference
- Wilcoxon Signed Ranks Test
- A quantitative Sign Test
- DOES use magnitude of difference of paired observations
- Confidence interval based on signed rank test
- what are the set of values for which you wouldn't have rejected the null hypothesis
Two or more sample nonparametric tests
- Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
- use deviations for the median and use the signed ranks
- SAS:
proc npar1way wilcoxon
- Class variable is used for the two or more groups
- Otherwise use
proc npar1way anova
- Wilcoxon Rank Sum Test/Mann-Whitney U statistic
- Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
- Equivalent of unequal variances t-test
- R:
wilcox.test
- Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
- Can also do confidence interval
- Kruskal-Wallis
- Non-parametric method for testing whether samples originate from the same distribution
- equivalent to One-way ANOVA
Goodness of fit for continuous distributions
one sample
- empirical cumulative distribution function, compare to theoretical
- R:
ecdf(data)
- R:
- Kolmogorov-Smirnov
- Not quite as good, because this just gives a max of the W statistic
- Do not estimate parameters from the data
- R:
ks.test(x, y="name")
Two-Sample
- Could have two distributions with the same mean but different shapes.
- R:
ks.test(X, Y)
Estimating Parameter Values
- R: MASS package,
fitdistr(data, densfun="exponential")
- obtain maximum likelihood estimate
Kernel Smoothing Density Function
- Matlab function
- [f,xi,u] = ksdensity(x)
- Computes a probability density estimate of the sample in the vector x
- f is the vector of density values evaluated at the points xi.
- u is the width of the kernal -smoothing window, which is calculated.
Linear Discriminant Analysis
Receiver operating characteristic curve
Linear Regression
- Linear Regression - Wikipedia article
- y = X beta + epsilon
- y = the regressand, dependent variable.
- X = the design matrix. x sub i are regressors
- each x sub i has a corresponding beta sub i called the intercept
- beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
- DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.
Regression Diagnostics
- multicollinearity -> VIF
- heteroscedasticity -> Scale-Location or Residual vs fitted
- Outliers -> Residuals vs Leverage or Leverage vs Cook's D
- Non-linearity -> Residual vs fitted
- Residual distribution -> Q-Q Plot
- Understanding Regression Diagnostic Plots
- R: Use ggfortify ::autoplot
Eigen vector & Eigen Value
Maximum Likelihood Estimate
Set theory symbols
- Set theory symbols
- : \varnothing, empty set
- : \mid, satisfies the condition
- : \cup, union
- : \cap, intersection
- : \setminus
- : \triangle, symmetric difference
- : \in - left side element is in right side set
- : \cdot, dot product, vector and matrix multiplication, scalar result
- : \times, cross product of vectors
- : \otimes, kronecker (outer) product of tensor (matrix)
Bayes
Factor Analysis
- number of variables too large
- deviations or variation that is of most interest
- reduce number of variables
- consider linear combinations of the variables
- keep the combos with large variance
- discard the ones with small variance
- latent variables explain the correlation between outcome variables
- interpretability of factors is sometimes suspect
- Used for exploratory data analysis
- >10 obs per variable
- Group variables into factors such that the variables are highly correlated
- Use PCA to examine latent common factors (1st method)
Principle Component Analysis
- Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
- factor loadings which represent