Maximum Likelihood Estimation

General

Obtain an estimate for an unknown parameter theta using the data that we obtained from our sample.
Choose a value of theta that maximizes the likelihood function of getting the data we observed.
Maximum likelihood principle
Likelihood function is viewed as a function of the parameters theta, and the data parameter y are considered fixed. Thus when we take derivative of the log-likelihood function, it is with respect to theta only.
Joint probability mass function: If the observations are independent you can just multiply the PDFs of the individual observations.
- $L(\theta )=\prod _{i=1}^{n}f(x_{i};\theta )$ (General formulation)
Sample mean ${\bar {x}}$ ${\bar {x}}$ is involved in MLE calculation for several models if data is IID
- MLE for Bernoulli(p), Poisson( lambda ), or Normal( mu, sigma^2 ), x bar is the MLE for p, lambda and mu

E.g., what is the estimate of mortality rate at a given hospital? Say each patient comes from a Bernoulli distribution
$Y_{i}\sim B(\theta )$ , where theta is unknown parameter, therefore using greek letter

$P(\mathbf {Y} =\mathbf {y} |\theta )=P(Y_{1}=y_{1},Y_{2}=y_{2},...,Y_{n}=y_{n}|\theta )$ using vector form (using bold for vector notation)

$P(\mathbf {Y} =\mathbf {y} |\theta )=P(Y_{1}=y_{1})...P(Y_{n}=y_{n}|\theta )=\prod _{i=1}^{n}P(Y_{i}=y_{i}|\theta )$ because they are independent

$P(\mathbf {Y} =\mathbf {y} |\theta )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$ $P(\mathbf {Y} =\mathbf {y} |\theta )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$ using what we know from Bernoulli distributions
- "The probability of observing the actual data we collected, conditioned on the value of the parameter theta."
- Concept of likelihood implies thinking about this density function as a function of theta

$L(\theta |\mathbf {y} )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$ $L(\theta |\mathbf {y} )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$
- The two functions look the same, whereas above is a function of y, given theta. Here the likelihood is a function of theta, given y. It's no longer a probability distribution, but it's still a function for theta.

$MLE:{\hat {\theta }}={\textrm {argmax}}L(\theta |\mathbf {y} )$ $MLE:{\hat {\theta }}={\textrm {argmax}}L(\theta |\mathbf {y} )$
- To estimate theta, choose the theta that gives us the largest value of the likelihood. It makes the data the most likely to occur for the particular data we observed.

$l(\theta )={\textrm {log}}L(\theta |\mathbf {y} )$ $l(\theta )={\textrm {log}}L(\theta |\mathbf {y} )$
- Since logarithm is a monotonic function, if we maximize logarithm of the function, we also maximize the original function
- Can drop "condition on y" notation here

$l(\theta )={\textrm {log}}\left[\prod \theta ^{y_{i}}(1-\theta )^{1-y_{i}}\right]=\sum {\textrm {log}}\left[\theta ^{y_{i}}(1-\theta )^{1-y_{i}}\right]=\sum \left[y_{i}{\textrm {log}}\theta +(1-y_{i}){\textrm {log}}(1-\theta )\right]$

$l(\theta )=\left(\sum y_{i}\right){\textrm {log}}\theta +\left(\sum (1-y_{i})\right){\textrm {log}}(1-\theta )$

$l'(\theta )={\frac {1}{\theta }}\sum y_{i}-{\frac {1}{1-\theta }}\sum (1-y_{i})=0$ $l'(\theta )={\frac {1}{\theta }}\sum y_{i}-{\frac {1}{1-\theta }}\sum (1-y_{i})=0$
- Here we take derivative and set = 0.

$0={\frac {\sum y_{i}}{\hat {\theta }}}-{\frac {\sum (1-y_{i})}{1-{\hat {\theta }}}}$ $0={\frac {\sum y_{i}}{\hat {\theta }}}-{\frac {\sum (1-y_{i})}{1-{\hat {\theta }}}}$
- The hat implies parameter estimate

${\hat {\theta }}={\frac {1}{n}}\sum y_{i}$
Approx Ci for 95% ${\hat {\theta }}\pm 1.96{\sqrt {\frac {{\hat {\theta }}(1-{\hat {\theta }})}{n}}}$

${\hat {\theta }}\pm \sim N\left(\theta ,{\frac {1}{I({\hat {\theta }})}}\right)$ ${\hat {\theta }}\pm \sim N\left(\theta ,{\frac {1}{I({\hat {\theta }})}}\right)$
- Under certain regularity conditions, we can say that the MLE is approximately normally distributed with mean at the true value of theta, and variance of one over the Fisher Information evaluated at theta hat
- Fisher Information is a measure of how much information about theta is in each data point. It's a function of theta

For Bernoulli random variable $I_{\textrm {Bernoulli}}(\theta )={\frac {1}{\theta (1-\theta )}}$ $I_{\textrm {Bernoulli}}(\theta )={\frac {1}{\theta (1-\theta )}}$
- Information is larger when theta is near zero or one, and it's smallest when theta is near one half.
- "This makes sense, because if you're flipping a coin, and you're getting a mix of heads and tails, that tells you a little bit less than if you're getting nearly all heads or nearly all tails. That's a lot more informative about the value of theta. "

Suppose we have samples from an exponential distribution with parameter lambda:
- $X_{i}\sim {\textrm {Exp}}(\lambda )$ , assuming i.i.d.

$f(\mathbf {x} |\lambda )=\prod _{i=1}^{n}\lambda e^{-\lambda x_{i}}=\lambda ^{n}e^{-\lambda \sum x_{i}}$ $f(\mathbf {x} |\lambda )=\prod _{i=1}^{n}\lambda e^{-\lambda x_{i}}=\lambda ^{n}e^{-\lambda \sum x_{i}}$
- Step 1: state the density function

$L(\lambda |\mathbf {x} )=\lambda ^{n}e^{-\lambda \sum x_{i}}$ $L(\lambda |\mathbf {x} )=\lambda ^{n}e^{-\lambda \sum x_{i}}$
- Step 2: turn it into a (non-log) likelihood function

$l(\lambda )=nlog\lambda -\lambda \sum x_{i}$ $l(\lambda )=nlog\lambda -\lambda \sum x_{i}$
- Take log likelihood and drop "conditioned on x" notation

$l'(\lambda )={\frac {n}{\lambda }}-\sum x_{i}=0$ $l'(\lambda )={\frac {n}{\lambda }}-\sum x_{i}=0$
- Take take derivative and set = 0

${\hat {\lambda }}={\frac {n}{\sum x_{i}}}={\frac {1}{\bar {x}}}$ ${\hat {\lambda }}={\frac {n}{\sum x_{i}}}={\frac {1}{\bar {x}}}$
- MLE for lambda is 1 over sample average, which makes sense because the mean for an exponential distribution is 1 over lambda