Difference between revisions of "Statistics"

From Colettapedia
Jump to navigation Jump to search
Line 87: Line 87:
 
* <math>E[X] = \frac{1}{\lambda}</math>
 
* <math>E[X] = \frac{1}{\lambda}</math>
 
* <math>Var[X] = \frac{1}{\lambda^2}</math>
 
* <math>Var[X] = \frac{1}{\lambda^2}</math>
 +
====Normal====
 +
write out functions, expectation and variance
 +
====Gamma====
 +
pass
 +
====Beta====
 +
pass
  
 
==Tests for Categorical Data==
 
==Tests for Categorical Data==

Revision as of 14:55, 7 May 2020

General

  • degrees of freedom = the number of values in the final calculation that are free to vary.
  • residuals = for each observation residual is the difference between that observation and the average of all the observations.
    • the sum of the residuals is necessarily 0.
  • probabilty mass function = pmf is for for DISCRETE random variables


Basic Probability

Joint probability (intersection)

  • The probability of both events happening together.
  • The joint probability of A and B is written , P(AB) or P(A, B)
  • latex set intersection sign \cap - Python operator & ampersand
  • Joint event = depends on classes from two different variables
  • Joint probability distribution for categorical variables - list out in a table, all numbers sum to 1. Marginal tallies is sum of joint probs, ignores one of the variables.

Marginal probability

  • Essentially the opposite of conditional probability.
  • If there are two possible outcomes for X with corresponding events B and B', this means that .
  • column rank and row rank

Union of two events

Conditional probability

  • The probability of some event A, given the occurrence of some other event B.
  • Conditional probability is written
  • Joint probability divided by the marginal probability
  • Conditional probability distribution - adds up to 1
    • Compare with marginal probability distribution

Variable Independence

  • If independent then
  • Imposing the condition B doesn't affect the probability of A at all.
  • In a sample, we would expect the two probabilities to not match up slightly anyway
  • If independent then the joint probability/intersection/upside-down U

Error bars

null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

Probability Distributions

  • Notation convention
    • Capital roman letter - a random variable
    • Lowercase roman letter - a possible value it might take
    • Unknown values represented with greek letters rather than roman
  • Probability mass function - gives the probability of different outcomes of the random variable
    • PMF for discrete and probability density function (PDF) if continuous. Can view everything as a density, though.

Discrete

Bernoulli

  • Only 2 possible outcomes, success or failure
  • X ~ B(p), P(X=1)= p, P(X=0)=1-p
  • f(X=x|p) = f(x|p) = p^x*(1-p)^(1-x)
  • Expected value E[X] = Sum over x xP(X=x) = (1)p+(0)(1-p) = p
  • Var(X)=p(1-p)

Binomial

  • Binomial = Generalization of Bernoulli where you have N repeated trials. The sum of N independent Bernoullis
  • X ~ Bin( n, p )
  • n = number of trials
  • P( X=x | p ) = f(x|p) = (n choose x) * p^x * (1-p) ^ (n-x)
    • N choose x is combinatorial = n! / ( x!*(n-x)!)
  • Expected value E[X] = np
  • Var(X)=np(1-p)
  • Binomial approximation standard score - how many standard deviations an observation is above or below the mean.

Continuous

  • Integral from -inf to inf = 1: "The probability that something happens = 1"
  • f(x) >= 0: Densities are non-negative for all possible values of x
  • E[X] = Integral from -inf to inf x f(x) dx
    • Analogous to the sum we have in discrete variables

Uniform

    • f(x) = Indicator function where condition is x is on interval
  • Integrate to ascertain probability between two given values
  • probability that x= some given value is vanishingly small, integrate from x to x = 0

Exponential

  • What a shitty misnomer
  • E.g, a bus that comes every 10 minutes, the exponential is your waiting time
  • Rate parameter
  • Events that occur at a particular rate, and the exponential is the waiting time between events
  • for x >= 0

Normal

write out functions, expectation and variance

Gamma

pass

Beta

pass

Tests for Categorical Data

  • Goodness of fit for single categorical variable
    • compare observed counts to the expected counts "contribution terms" for
    • Get relative distance the observed are from the expected
    • Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
    • If null hypothesis is true, observed is close to expected
    • Relative distance the observed are to the expected
    • Test statistic has chi-squared distribution.
 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);
testf=(values)

Tests for two-way variables

  • test for homogeneity - distribution of proportions are the same across the populations
  • test of independence -
 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
testf=(values)
  • Use fisher's exact test if sample num is small.
    • R: fisher.test(table)
  • cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
  • weight statement indicates the variable in the table

T-Test

  • "Student's t-distribution"
  • When data are normally distributed
  • Can test hypotheses about the mean/center of the distribution

One-sample t-test

  • Test is mean greater than/less than/equal to some value
  • SAS proc means

Two-Sample t-test

  • Test whether two population means are equal.
  • Unpaired or independent samples t-test: Are the variances the same?
    • If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
    • If yes it's called a "pooled t-test" or "Student's t-test"
    • F-statistic tests whether the variances are equal
  • Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

Nonparametric tests

  • Hypothesis testing when you can't assume data comes from normal distribution
  • a lot of non-parametric approaches are based on ranks
  • do not depend on normality
  • Where as the other test are really tests for means, npar tests are actually for medians

One-sample tests

  • SAS proc univariate for these
  • Sign test
    • Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
    • PAIRED observations with test x > y, x = y, or x < y.
    • Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
    • Does one member of the pair tend to be greater than the other?
    • Does NOT assume symmetric distribution of the differences around the median
    • Does NOT use the magnitude of the difference
  • Wilcoxon Signed Ranks Test
    • A quantitative Sign Test
    • DOES use magnitude of difference of paired observations
  • Confidence interval based on signed rank test
    • what are the set of values for which you wouldn't have rejected the null hypothesis

Two or more sample nonparametric tests

  • Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
  • use deviations for the median and use the signed ranks
  • SAS: proc npar1way wilcoxon
    • Class variable is used for the two or more groups
    • Otherwise use proc npar1way anova
  • Wilcoxon Rank Sum Test/Mann-Whitney U statistic
    • Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
    • Equivalent of unequal variances t-test
    • R: wilcox.test
    • Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
    • Can also do confidence interval
  • Kruskal-Wallis
    • Non-parametric method for testing whether samples originate from the same distribution
    • equivalent to One-way ANOVA

Goodness of fit for continuous distributions

one sample

  • empirical cumulative distribution function, compare to theoretical
    • R: ecdf(data)
  • Kolmogorov-Smirnov
    • Not quite as good, because this just gives a max of the W statistic
  • Do not estimate parameters from the data
  • R: ks.test(x, y="name")

Two-Sample

  • Could have two distributions with the same mean but different shapes.
  • R: ks.test(X, Y)

Estimating Parameter Values

  • R: MASS package, fitdistr(data, densfun="exponential")
    • obtain maximum likelihood estimate

Kernel Smoothing Density Function

  • Matlab function
  • [f,xi,u] = ksdensity(x)
  • Computes a probability density estimate of the sample in the vector x
  • f is the vector of density values evaluated at the points xi.
  • u is the width of the kernal -smoothing window, which is calculated.

Linear Discriminant Analysis

Receiver operating characteristic curve

Linear Regression

  • Linear Regression - Wikipedia article
  • y = X beta + epsilon
  • y = the regressand, dependent variable.
  • X = the design matrix. x sub i are regressors
  • each x sub i has a corresponding beta sub i called the intercept
  • beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
  • DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

Regression Diagnostics

  • multicollinearity -> VIF
  • heteroscedasticity -> Scale-Location or Residual vs fitted
  • Outliers -> Residuals vs Leverage or Leverage vs Cook's D
  • Non-linearity -> Residual vs fitted
  • Residual distribution -> Q-Q Plot
  • Understanding Regression Diagnostic Plots
  • R: Use ggfortify ::autoplot

Eigen vector & Eigen Value

Eigen values and eigenvectors


Maximum Likelihood Estimate

Mixed and Multilevel Models

Set theory symbols

  • Set theory symbols
  • : \varnothing, empty set
  • : \mid, satisfies the condition
  • : \cup, union
  • : \cap, intersection
  • : \setminus
  • : \triangle, symmetric difference
  • : \in - left side element is in right side set
  • : \cdot, dot product, vector and matrix multiplication, scalar result
  • : \times, cross product of vectors
  • : \otimes, kronecker (outer) product of tensor (matrix)

Bayes

Factor Analysis

  • number of variables too large
  • deviations or variation that is of most interest
  • reduce number of variables
  • consider linear combinations of the variables
  • keep the combos with large variance
  • discard the ones with small variance
  • latent variables explain the correlation between outcome variables
  • interpretability of factors is sometimes suspect
  • Used for exploratory data analysis
  • >10 obs per variable
  • Group variables into factors such that the variables are highly correlated
  • Use PCA to examine latent common factors (1st method)

Principle Component Analysis

  • Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
  • factor loadings which represent