Difference between revisions of "Maximum Likelihood Estimation"

From Colettapedia
Jump to navigation Jump to search
Tags: Mobile edit Mobile web edit
Tags: Mobile edit Mobile web edit
 
Line 3: Line 3:
 
* Obtain an estimate for an unknown parameter theta using the data that we obtained from our sample.
 
* Obtain an estimate for an unknown parameter theta using the data that we obtained from our sample.
 
* Choose a value of theta that maximizes the likelihood function of getting the data we observed.
 
* Choose a value of theta that maximizes the likelihood function of getting the data we observed.
* Maximum likelihood principl
+
* Maximum likelihood principle
 +
* Likelihood function is viewed as a function of the parameters theta, and the data parameter y  are considered fixed. Thus when we take derivative of the log-likelihood function, it is with respect to theta only.
 
* Joint probability mass function: If the observations are independent you can just multiply the PDFs of the individual observations.
 
* Joint probability mass function: If the observations are independent you can just multiply the PDFs of the individual observations.
 
** <math>L(\theta)=\prod_{i=1}^n f(x_i; \theta)</math> (General formulation)
 
** <math>L(\theta)=\prod_{i=1}^n f(x_i; \theta)</math> (General formulation)
 +
* Sample mean <math>\bar{ x }</math> is involved in MLE calculation for several models if data is IID
 +
** MLE for Bernoulli(p), Poisson( lambda ), or Normal( mu, sigma^2 ), x bar is the MLE for p, lambda and mu
  
 
==Bernoulli Distribution==
 
==Bernoulli Distribution==

Latest revision as of 02:59, 13 May 2020

General

  • Obtain an estimate for an unknown parameter theta using the data that we obtained from our sample.
  • Choose a value of theta that maximizes the likelihood function of getting the data we observed.
  • Maximum likelihood principle
  • Likelihood function is viewed as a function of the parameters theta, and the data parameter y are considered fixed. Thus when we take derivative of the log-likelihood function, it is with respect to theta only.
  • Joint probability mass function: If the observations are independent you can just multiply the PDFs of the individual observations.
    • (General formulation)
  • Sample mean is involved in MLE calculation for several models if data is IID
    • MLE for Bernoulli(p), Poisson( lambda ), or Normal( mu, sigma^2 ), x bar is the MLE for p, lambda and mu

Bernoulli Distribution

  • E.g., what is the estimate of mortality rate at a given hospital? Say each patient comes from a Bernoulli distribution
  • , where theta is unknown parameter, therefore using greek letter
  • for a single given person
  • using vector form (using bold for vector notation)
  • because they are independent
  • using what we know from Bernoulli distributions
    • "The probability of observing the actual data we collected, conditioned on the value of the parameter theta."
    • Concept of likelihood implies thinking about this density function as a function of theta
    • The two functions look the same, whereas above is a function of y, given theta. Here the likelihood is a function of theta, given y. It's no longer a probability distribution, but it's still a function for theta.
    • To estimate theta, choose the theta that gives us the largest value of the likelihood. It makes the data the most likely to occur for the particular data we observed.
    • Since logarithm is a monotonic function, if we maximize logarithm of the function, we also maximize the original function
    • Can drop "condition on y" notation here
    • Here we take derivative and set = 0.
    • The hat implies parameter estimate
  • Approx Ci for 95%
    • Under certain regularity conditions, we can say that the MLE is approximately normally distributed with mean at the true value of theta, and variance of one over the Fisher Information evaluated at theta hat
    • Fisher Information is a measure of how much information about theta is in each data point. It's a function of theta
  • For Bernoulli random variable
    • Information is larger when theta is near zero or one, and it's smallest when theta is near one half.
    • "This makes sense, because if you're flipping a coin, and you're getting a mix of heads and tails, that tells you a little bit less than if you're getting nearly all heads or nearly all tails. That's a lot more informative about the value of theta. "

Exponential Distribution

  • Suppose we have samples from an exponential distribution with parameter lambda:
    • , assuming i.i.d.
    • Step 1: state the density function
    • Step 2: turn it into a (non-log) likelihood function
    • Take log likelihood and drop "conditioned on x" notation
    • Take take derivative and set = 0
    • MLE for lambda is 1 over sample average, which makes sense because the mean for an exponential distribution is 1 over lambda