Difference between revisions of "Statistics"

From Colettapedia
Jump to navigation Jump to search
Tags: Mobile edit Mobile web edit
Tags: Mobile edit Mobile web edit
Line 195: Line 195:
  
 
== Bayesian Inference ==
 
== Bayesian Inference ==
 +
 
* An advantage of the Bayesian approach is that it allows you to easily incorporate prior information, when you know something in advance of the looking at the data.
 
* An advantage of the Bayesian approach is that it allows you to easily incorporate prior information, when you know something in advance of the looking at the data.
* Bayesian approach is inherently subjective
+
* Under the Bayesian paradigm, we're explicit that this is a subjective and personal approach. But we can also be explicit about what all of our assumptions are, and then see how our answers depend on our assumptions.
 +
* Contrast with frequentist paradigm
 +
** The Frequentist approach has a number of buried assumptions.
 +
** You can't get a good confidence interval. What would is even mean to have a confidence interval for whether or not the coin is loaded?
 +
** The choice of our reference population is inherently subjective anyway.
 +
** The choice of what is our likelihood
 +
** There's an attempt to pretend that everything is objective
  
 
=== Bernoulli Example ===
 
=== Bernoulli Example ===

Revision as of 18:46, 16 May 2020

General

  • degrees of freedom = the number of values in the final calculation that are free to vary.
  • residuals = for each observation residual is the difference between that observation and the average of all the observations.
    • the sum of the residuals is necessarily 0.
  • probabilty mass function = pmf is for for DISCRETE random variables


Basic Probability

Joint probability (intersection)

  • The probability of both events happening together.
  • The joint probability of A and B is written , P(AB) or P(A, B)
  • latex set intersection sign \cap - Python operator & ampersand
  • Joint event = depends on classes from two different variables
  • Joint probability distribution for categorical variables - list out in a table, all numbers sum to 1. Marginal tallies is sum of joint probs, ignores one of the variables.

Marginal probability

  • Essentially the opposite of conditional probability.
  • If there are two possible outcomes for X with corresponding events B and B', this means that .
  • column rank and row rank

Union of two events

Conditional probability

  • The probability of some event A, given the occurrence of some other event B.
  • Conditional probability is written
  • Joint probability divided by the marginal probability
  • Conditional probability distribution - adds up to 1
    • Compare with marginal probability distribution

Variable Independence

  • If independent then
  • Imposing the condition B doesn't affect the probability of A at all.
  • In a sample, we would expect the two probabilities to not match up slightly anyway
  • If independent then the joint probability/intersection/upside-down U

Error bars

null hypothesis

The null hypothesis typically corresponds to a general or default position. The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it.

Probability Distributions

  • Notation convention
    • Capital roman letter - a random variable
    • Lowercase roman letter - a possible value it might take
    • Unknown values represented with greek letters rather than roman
  • Probability mass function - gives the probability of different outcomes of the random variable
    • PMF for discrete and probability density function (PDF) if continuous. Can view everything as a density, though.

Discrete Distributions

Bernoulli

  • Only 2 possible outcomes, success or failure
  • , ,
    • Expected value

Binomial

  • Binomial = Generalization of Bernoulli where you have N repeated trials. The sum of N independent Bernoullis
    • n = number of trials
      • "n choose x" is combinatorial
    • Expected value
  • Binomial approximation standard score - how many standard deviations an observation is above or below the mean.

Geometric

  • The number of trials to observe a success

Multinomial

  • Generalize bernoulli and binomial to more than one possible outcome

Poisson

  • Used for counts
  • parameter is the rate at which we expect to observe the thing we are counting

Continuous Distributions

  • : "The probability that something happens = 1"
  • f(x) >= 0: Densities are non-negative for all possible values of x
  • E[X] = Integral from -inf to inf x f(x) dx
    • Analogous to the sum we have in discrete variables

Uniform

    • f(x) = Indicator function where condition is x is on interval
  • Integrate to ascertain probability between two given values
  • probability that x= some given value is vanishingly small, integrate from x to x = 0

Exponential

  • E.g, a bus that comes every 10 minutes, the exponential is your waiting time
  • Rate parameter
  • Events that occur at a particular rate, and the exponential is the waiting time between events
  • for x >= 0

Normal

Standard Normal
Parameterized Normal with mu and sigma

t distribution

  • Use if you don't know the true value of sigma. Replacing with sample standard deviation causes
  • Uses gamma distribution

Gamma

  • Total waiting time for all events to occur, for more than in random variable with an exponential distribution.

Beta

  • Used for random variables which take on values between 0 and 1. Commonly used to model probabilities.

Cumulative Distribution Function

  • CDF exists for every distribution
  • It's convenient for calculating probabilities of intervals, e.g., P( -1 < X < 1 )
    • - for PMF discrete distributions
    • where f(t) is probability density function

R functions

Normal distribution

  • dnorm( x, mean, sd ) - Evaluate PDF at x with mean = and sd =
  • pnorm( q, mean, sd ) - Evaluate CDF at q
  • qnorm( p, mean, sd ) - Evaluate the quantile function at p
  • rnorm( n, mean, sd ) - Evaluate n pseudo-random samples from the normal distribution.

Various distributions

  • dbinom(x, size, prob)
  • dpois(x, lambda)
  • dexp(x, rate)
  • dgamma(x, shape, rate)
  • dunif(x, min, max)
  • dbeta(x, shape1, shape2)
  • dnorm(x, mean, sd)
  • dt(x, df) where df


Frequntist approach to SI

  • View the data as a RANDOM sample from some larger population
  • reference population - the larger supergroup of instance for which you're trying to generalize to based on the sample

Confidence Intervals

  • Central Limit Theorem says the sum of all the xi's will follow approximately a Gaussian distribution
  • 95% of the time we get a result within 1.96 standard deviations of the mean.
  • Use mean and standard deviation to define confidence intervals for a given level of confidence
  • We're 95% confident that the true population wide mean is on that interval
  • E.g., "It's plausible (supported by the data) that the coin is fair because 0.5 is on the interval."
  • Frequentist interpretation: In an infinite hypothetical sequence of events and we repeat this trial an infinite number of times, each time creating confidence interval based on the data we observe. On average, 95% of the intervals we make will contain the true value of p.
  • But what about THIS interval? Does this interval contain the true value of p? What's the probability that this interval contains contains a true p? From the frequentist perspective, the probability that p is on the interval is either 0 or 1

Bernoulli Example

  • Two coins, a loaded coin with P( heads ) = 0.7, and fair coin
  • What is the probability that the coin is loaded after you observe 5 trials?
  • Which coin do you think it is, and how sure are you about that?
    • Theta is the unknown parameter
    • The data is 5 flips and the question is what's that probability.
    • This is the likelihood function using indicator (step) function notation
  • Say we observe two heads X=2, what's our likelihood (using f as notation instead of l)
    • plug in X=2, get 0.3125 if and 0.1323 if
    • Having observed two heads, we can say that the likelihood is higher for theta = fair than theta = loaded.
    • - given the data, it's most likely that the coin is fair
  • This is a point estimate, but how to answer the question how sure are you?
  • Another question: what is
    • In frequentist paradigm, the coin is a physical quantity, it's a fixed coin, and therefor has a fixed probability of coming up heads
    • In this interpretation and this probability =

Bayesian Inference

  • An advantage of the Bayesian approach is that it allows you to easily incorporate prior information, when you know something in advance of the looking at the data.
  • Under the Bayesian paradigm, we're explicit that this is a subjective and personal approach. But we can also be explicit about what all of our assumptions are, and then see how our answers depend on our assumptions.
  • Contrast with frequentist paradigm
    • The Frequentist approach has a number of buried assumptions.
    • You can't get a good confidence interval. What would is even mean to have a confidence interval for whether or not the coin is loaded?
    • The choice of our reference population is inherently subjective anyway.
    • The choice of what is our likelihood
    • There's an attempt to pretend that everything is objective

Bernoulli Example

  • What is the probability that the coin is loaded before and after you observe 5 trials?
  • Prior
    • is the updated posterior
    • Here we state likelihood function as f(x) rather than l(x)
    • Denominator is the sum over all possibilities of theta, of which there are only two: . Denominator is the normalizing constant so that the coefficients in the numerator sum to 1.
    • Multiply first term times the prior probability that the coin is fair, i.e., 0.4, and the second term that the coin is loaded.
  • Various posteriors given various priors:

Tests for Categorical Data

  • Goodness of fit for single categorical variable
    • compare observed counts to the expected counts "contribution terms" for
    • Get relative distance the observed are from the expected
    • Get p-values from chi squared distribution with k-1 deg freedom, where k = number of categories (i.e., classes)
    • If null hypothesis is true, observed is close to expected
    • Relative distance the observed are to the expected
    • Test statistic has chi-squared distribution.
 proc freq data=<whatevs>;
table vvar1 / chisq;
table var2 / chisq testp=(values);
testf=(values)

Tests for two-way variables

  • test for homogeneity - distribution of proportions are the same across the populations
  • test of independence -
 proc freq data=<whatevs>;
table vvar1 / chisq exact or Fisher;
table var2 / chisq cellchi2;
testf=(values)
  • Use fisher's exact test if sample num is small.
    • R: fisher.test(table)
  • cellchi2 is cell contribution - how far the observed from the expected on a per cell basis
  • weight statement indicates the variable in the table

T-Test

  • "Student's t-distribution"
  • When data are normally distributed
  • Can test hypotheses about the mean/center of the distribution

One-sample t-test

  • Test is mean greater than/less than/equal to some value
  • SAS proc means

Two-Sample t-test

  • Test whether two population means are equal.
  • Unpaired or independent samples t-test: Are the variances the same?
    • If no, it's called "two-Sample t-test" or "unequal variances t-test" or "a Welch's t-test"
    • If yes it's called a "pooled t-test" or "Student's t-test"
    • F-statistic tests whether the variances are equal
  • Paired or repeated measurements t-test - obs before and after is subtracted, is the difference different than zero?

Nonparametric tests

  • Hypothesis testing when you can't assume data comes from normal distribution
  • a lot of non-parametric approaches are based on ranks
  • do not depend on normality
  • Where as the other test are really tests for means, npar tests are actually for medians

One-sample tests

  • SAS proc univariate for these
  • Sign test
    • Sign test is necessiarily one sample, so if you give func call two, it will assume it's a paired dataset
    • PAIRED observations with test x > y, x = y, or x < y.
    • Test for consistent differences between pairs, such as the weight of subjects pre and post treatment.
    • Does one member of the pair tend to be greater than the other?
    • Does NOT assume symmetric distribution of the differences around the median
    • Does NOT use the magnitude of the difference
  • Wilcoxon Signed Ranks Test
    • A quantitative Sign Test
    • DOES use magnitude of difference of paired observations
  • Confidence interval based on signed rank test
    • what are the set of values for which you wouldn't have rejected the null hypothesis

Two or more sample nonparametric tests

  • Compare the centers of two or more distributions that are continuous, but not normal/gaussian.
  • use deviations for the median and use the signed ranks
  • SAS: proc npar1way wilcoxon
    • Class variable is used for the two or more groups
    • Otherwise use proc npar1way anova
  • Wilcoxon Rank Sum Test/Mann-Whitney U statistic
    • Null hypothesis: it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
    • Equivalent of unequal variances t-test
    • R: wilcox.test
    • Intermix the observations, sort, and rank all observations. Then take mean rank for both populations.
    • Can also do confidence interval
  • Kruskal-Wallis
    • Non-parametric method for testing whether samples originate from the same distribution
    • equivalent to One-way ANOVA

Goodness of fit for continuous distributions

one sample

  • empirical cumulative distribution function, compare to theoretical
    • R: ecdf(data)
  • Kolmogorov-Smirnov
    • Not quite as good, because this just gives a max of the W statistic
  • Do not estimate parameters from the data
  • R: ks.test(x, y="name")

Two-Sample

  • Could have two distributions with the same mean but different shapes.
  • R: ks.test(X, Y)

Estimating Parameter Values

  • R: MASS package, fitdistr(data, densfun="exponential")
    • obtain maximum likelihood estimate

Kernel Smoothing Density Function

  • Matlab function
  • [f,xi,u] = ksdensity(x)
  • Computes a probability density estimate of the sample in the vector x
  • f is the vector of density values evaluated at the points xi.
  • u is the width of the kernal -smoothing window, which is calculated.

Linear Discriminant Analysis

Receiver operating characteristic curve

Linear Regression

  • Linear Regression - Wikipedia article
  • y = X beta + epsilon
  • y = the regressand, dependent variable.
  • X = the design matrix. x sub i are regressors
  • each x sub i has a corresponding beta sub i called the intercept
  • beta = a p-dimensional parameter vector called regression coefficients. In case of line, beta1 is slope and beta0 is y-intercept
  • DISTURBANCE TERM - epsilon - an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors.

Regression Diagnostics

  • multicollinearity -> VIF
  • heteroscedasticity -> Scale-Location or Residual vs fitted
  • Outliers -> Residuals vs Leverage or Leverage vs Cook's D
  • Non-linearity -> Residual vs fitted
  • Residual distribution -> Q-Q Plot
  • Understanding Regression Diagnostic Plots
  • R: Use ggfortify ::autoplot

Eigen vector & Eigen Value

Eigen values and eigenvectors


Mixed and Multilevel Models

Mixed and Multilevel Models

Set theory symbols

  • Set theory symbols
  • : \varnothing, empty set
  • : \mid, satisfies the condition
  • : \cup, union
  • : \cap, intersection
  • : \setminus
  • : \triangle, symmetric difference
  • : \in - left side element is in right side set
  • : \cdot, dot product, vector and matrix multiplication, scalar result
  • : \times, cross product of vectors
  • : \otimes, kronecker (outer) product of tensor (matrix)

Bayesian Statistics

Maximum Likelihood Estimation

Factor Analysis

  • number of variables too large
  • deviations or variation that is of most interest
  • reduce number of variables
  • consider linear combinations of the variables
  • keep the combos with large variance
  • discard the ones with small variance
  • latent variables explain the correlation between outcome variables
  • interpretability of factors is sometimes suspect
  • Used for exploratory data analysis
  • >10 obs per variable
  • Group variables into factors such that the variables are highly correlated
  • Use PCA to examine latent common factors (1st method)

Principle Component Analysis

  • Replace original observed random variables with uncorellated linear combinations result in minimum loss of information.
  • factor loadings which represent