# Maximum Likelihood Estimation

Jump to navigation Jump to search

## General

• Obtain an estimate for an unknown parameter theta using the data that we obtained from our sample.
• Choose a value of theta that maximizes the likelihood function of getting the data we observed.
• Maximum likelihood principle
• Likelihood function is viewed as a function of the parameters theta, and the data parameter y are considered fixed. Thus when we take derivative of the log-likelihood function, it is with respect to theta only.
• Joint probability mass function: If the observations are independent you can just multiply the PDFs of the individual observations.
• ${\displaystyle L(\theta )=\prod _{i=1}^{n}f(x_{i};\theta )}$ (General formulation)
• Sample mean ${\displaystyle {\bar {x}}}$ is involved in MLE calculation for several models if data is IID
• MLE for Bernoulli(p), Poisson( lambda ), or Normal( mu, sigma^2 ), x bar is the MLE for p, lambda and mu

## Bernoulli Distribution

• E.g., what is the estimate of mortality rate at a given hospital? Say each patient comes from a Bernoulli distribution
• ${\displaystyle Y_{i}\sim B(\theta )}$, where theta is unknown parameter, therefore using greek letter
• ${\displaystyle P(Y_{i}=1)=\theta }$ for a single given person
• ${\displaystyle P(\mathbf {Y} =\mathbf {y} |\theta )=P(Y_{1}=y_{1},Y_{2}=y_{2},...,Y_{n}=y_{n}|\theta )}$ using vector form (using bold for vector notation)
• ${\displaystyle P(\mathbf {Y} =\mathbf {y} |\theta )=P(Y_{1}=y_{1})...P(Y_{n}=y_{n}|\theta )=\prod _{i=1}^{n}P(Y_{i}=y_{i}|\theta )}$ because they are independent
• ${\displaystyle P(\mathbf {Y} =\mathbf {y} |\theta )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}}$ using what we know from Bernoulli distributions
• "The probability of observing the actual data we collected, conditioned on the value of the parameter theta."
• Concept of likelihood implies thinking about this density function as a function of theta
• ${\displaystyle L(\theta |\mathbf {y} )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}}$
• The two functions look the same, whereas above is a function of y, given theta. Here the likelihood is a function of theta, given y. It's no longer a probability distribution, but it's still a function for theta.
• ${\displaystyle MLE:{\hat {\theta }}={\textrm {argmax}}L(\theta |\mathbf {y} )}$
• To estimate theta, choose the theta that gives us the largest value of the likelihood. It makes the data the most likely to occur for the particular data we observed.
• ${\displaystyle l(\theta )={\textrm {log}}L(\theta |\mathbf {y} )}$
• Since logarithm is a monotonic function, if we maximize logarithm of the function, we also maximize the original function
• Can drop "condition on y" notation here
• ${\displaystyle l(\theta )={\textrm {log}}\left[\prod \theta ^{y_{i}}(1-\theta )^{1-y_{i}}\right]=\sum {\textrm {log}}\left[\theta ^{y_{i}}(1-\theta )^{1-y_{i}}\right]=\sum \left[y_{i}{\textrm {log}}\theta +(1-y_{i}){\textrm {log}}(1-\theta )\right]}$
• ${\displaystyle l(\theta )=\left(\sum y_{i}\right){\textrm {log}}\theta +\left(\sum (1-y_{i})\right){\textrm {log}}(1-\theta )}$
• ${\displaystyle l'(\theta )={\frac {1}{\theta }}\sum y_{i}-{\frac {1}{1-\theta }}\sum (1-y_{i})=0}$
• Here we take derivative and set = 0.
• ${\displaystyle 0={\frac {\sum y_{i}}{\hat {\theta }}}-{\frac {\sum (1-y_{i})}{1-{\hat {\theta }}}}}$
• The hat implies parameter estimate
• ${\displaystyle {\hat {\theta }}={\frac {1}{n}}\sum y_{i}}$
• Approx Ci for 95% ${\displaystyle {\hat {\theta }}\pm 1.96{\sqrt {\frac {{\hat {\theta }}(1-{\hat {\theta }})}{n}}}}$
• ${\displaystyle {\hat {\theta }}\pm \sim N\left(\theta ,{\frac {1}{I({\hat {\theta }})}}\right)}$
• Under certain regularity conditions, we can say that the MLE is approximately normally distributed with mean at the true value of theta, and variance of one over the Fisher Information evaluated at theta hat
• Fisher Information is a measure of how much information about theta is in each data point. It's a function of theta
• For Bernoulli random variable ${\displaystyle I_{\textrm {Bernoulli}}(\theta )={\frac {1}{\theta (1-\theta )}}}$
• Information is larger when theta is near zero or one, and it's smallest when theta is near one half.
• "This makes sense, because if you're flipping a coin, and you're getting a mix of heads and tails, that tells you a little bit less than if you're getting nearly all heads or nearly all tails. That's a lot more informative about the value of theta. "

## Exponential Distribution

• Suppose we have samples from an exponential distribution with parameter lambda:
• ${\displaystyle X_{i}\sim {\textrm {Exp}}(\lambda )}$, assuming i.i.d.
• ${\displaystyle f(\mathbf {x} |\lambda )=\prod _{i=1}^{n}\lambda e^{-\lambda x_{i}}=\lambda ^{n}e^{-\lambda \sum x_{i}}}$
• Step 1: state the density function
• ${\displaystyle L(\lambda |\mathbf {x} )=\lambda ^{n}e^{-\lambda \sum x_{i}}}$
• Step 2: turn it into a (non-log) likelihood function
• ${\displaystyle l(\lambda )=nlog\lambda -\lambda \sum x_{i}}$
• Take log likelihood and drop "conditioned on x" notation
• ${\displaystyle l'(\lambda )={\frac {n}{\lambda }}-\sum x_{i}=0}$
• Take take derivative and set = 0
• ${\displaystyle {\hat {\lambda }}={\frac {n}{\sum x_{i}}}={\frac {1}{\bar {x}}}}$
• MLE for lambda is 1 over sample average, which makes sense because the mean for an exponential distribution is 1 over lambda