Difference between revisions of "Maximum Likelihood Estimation"

Revision as of 17:59, 11 May 2020

General

Obtain an estimate for an unknown parameter theta using the data that we obtained from our sample.
Choose a value of theta that maximizes the likelihood of getting the data we observed.
Joint probability mass function: If the observations are independent you can just multiply the PDFs of the individual observations.
- $L(\theta )=\prod _{i=1}^{n}f(x_{i};\theta )$ (General formulation)

Bernoulli Distribution

E.g., what is the estimate of mortality rate at a given hospital? Say each patient comes from a Bernoulli distribution
$Y_{i}\sim B(\theta )$ , where theta is unknown parameter, therefore using greek letter
$P(Y_{i}=1)=\theta$ for a single given person
$P(\mathbf {Y} =\mathbf {y} |\theta )=P(Y_{1}=y_{1},Y_{2}=y_{2},...,Y_{n}=y_{n}|\theta )$ using vector form (using bold for vector notation)
$P(\mathbf {Y} =\mathbf {y} |\theta )=P(Y_{1}=y_{1})...P(Y_{n}=y_{n}|\theta )=\prod _{i=1}^{n}P(Y_{i}=y_{i}|\theta )$ because they are independent
$P(\mathbf {Y} =\mathbf {y} |\theta )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$ $P(\mathbf {Y} =\mathbf {y} |\theta )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$ using what we know from Bernoulli distributions
- "The probability of observing the actual data we collected, conditioned on the value of the parameter theta."
- Concept of likelihood implies thinking about this density function as a function of theta
$L(\theta |\mathbf {y} )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$ $L(\theta |\mathbf {y} )=\prod _{i=1}^{n}\theta ^{y_{i}}(1-\theta )^{1-y_{i}}$
- The two functions look the same, whereas above is a function of y, given theta. Here the likelihood is a function of theta, given y. It's no longer a probability distribution, but it's still a function for theta.
$MLE:{\hat {\theta }}={\textrm {argmax}}L(\theta |\mathbf {y} )$ $MLE:{\hat {\theta }}={\textrm {argmax}}L(\theta |\mathbf {y} )$
- To estimate theta, choose the theta that gives us the largest value of the likelihood. It makes the data the most likely to occur for the particular data we observed.
$l(\theta )={\textrm {log}}L(\theta |\mathbf {y} )$ $l(\theta )={\textrm {log}}L(\theta |\mathbf {y} )$
- Since logarithm is a monotonic function, if we maximize logarithm of the function, we also maximize the original function
- Can drop "condition on y" notation here
$l(\theta )={\textrm {log}}\left[\prod \theta ^{y_{i}}(1-\theta )^{1-y_{i}}\right]=\sum {\textrm {log}}\left[\theta ^{y_{i}}(1-\theta )^{1-y_{i}}\right]=\sum \left[y_{i}{\textrm {log}}\theta +(1-y_{i}){\textrm {log}}(1-\theta )\right]$
$l(\theta )=\left(\sum y_{i}\right){\textrm {log}}\theta +\left(\sum (1-y_{i})\right){\textrm {log}}(1-\theta )$
$l'(\theta )={\frac {1}{\theta }}\sum y_{i}-{\frac {1}{1-\theta }}\sum (1-y_{i})=0$ $l'(\theta )={\frac {1}{\theta }}\sum y_{i}-{\frac {1}{1-\theta }}\sum (1-y_{i})=0$
- Here we take derivative and set = 0.
$0={\frac {\sum y_{i}}{\hat {\theta }}}-{\frac {\sum (1-y_{i})}{1-{\hat {\theta }}}}$ $0={\frac {\sum y_{i}}{\hat {\theta }}}-{\frac {\sum (1-y_{i})}{1-{\hat {\theta }}}}$
- The hat implies parameter estimate
${\hat {\theta }}={\frac {}{n}}\sum y_{i}$
Approx Ci for 95% ${\hat {\theta }}\pm 1.96{\sqrt {\frac {{\hat {\theta }}(1-{\hat {\theta }})}{n}}}$

Exponential Distribution

Suppose we have samples from an exponential distribution with parameter lambda:
- $X_{i}\sim {\textrm {Exp}}(\lambda )$ , assuming i.i.d.
Recall that the density is the product of $f(\mathbf {x} |\lambda )=\prod _{i=1}^{n}\lambda e^{-\lambda x_{i}}=\lambda ^{n}e^{-\lambda \sum x_{i}}$
$L(\lambda |\mathbf {x} )=\lambda ^{n}e^{-\lambda \sum x_{i}}$

@@ Line 1: / Line 1: @@
 ==General==
 * Obtain an estimate for an unknown parameter theta using the data that we obtained from our sample.
 * Choose a value of theta that maximizes the likelihood of getting the data we observed.
 * Joint probability mass function: If the observations are independent you can just multiply the PDFs of the individual observations.
 ** <math>L(\theta)=\prod_{i=1}^n f(x_i;\theta)</math> (General formulation)
 ==Bernoulli Distribution==
-* <math>f(x_i;p)=p^{x_i}(1-p)^{1-x_i}</math> for xi = 0 or 1 and 0 < p < 1.
-* If the Xi are independent Bernoulli random variables with unknown parameter p, replace the general notation with the bernoulli notation:
+* E.g., what is the estimate of mortality rate at a given hospital? Say each patient comes from a Bernoulli distribution
-* <math>L(p)=p^{\sum x_i}(1-p)^{n-\sum x_i}</math>
+* <math>Y_i \sim B( \theta )</math>, where theta is unknown parameter, therefore using greek letter
-* <math>log L(p) = (\sum x_i) log(p) + (n- \sum x_i) log( 1-p)</math>
+* <math>P( Y_i = 1 ) = \theta </math> for a single given person
+*<math> P( \mathbf{Y} = \mathbf{y} | \theta ) = P( Y_1 = y_1, Y_2=y_2, ... , Y_n= y_n | \theta )</math>  using vector form (using bold for vector notation)
+* <math> P( \mathbf{Y} = \mathbf{y} | \theta ) = P( Y_1 = y_1) ... P(Y_n= y_n | \theta ) = \prod_{i=1}^n P( Y_i = y_i | \theta )</math> because they are independent
+* <math> P( \mathbf{Y} = \mathbf{y} | \theta ) = \prod_{i=1}^n \theta^{y_i} (1-\theta)^{1-y_i}</math> using what we know from Bernoulli distributions
+** "The probability of observing the actual data we collected, conditioned on the value of the parameter theta."
+** Concept of likelihood implies thinking about this density function as a function of theta
+* <math>L( \theta | \mathbf{ y } ) = \prod_{i=1}^n \theta^{y_i} (1-\theta)^{1-y_i}</math>
+** The two functions look the same, whereas above is a function of y, given theta. Here the likelihood is a function of theta, given y. It's no longer a probability distribution, but it's still a function for theta.
+* <math>MLE: \hat{ \theta } = \textrm{argmax} L( \theta | \mathbf{ y } )</math>
+** To estimate theta, choose the theta that gives us the largest value of the likelihood. It makes the data the most likely to occur for the particular data we observed.
+* <math> l ( \theta ) = \textrm{log} L ( \theta | \mathbf{y} )</math>
+** Since logarithm is a monotonic function, if we maximize logarithm of the function, we also maximize the original function
+** Can drop "condition on y" notation here
+* <math> l ( \theta ) = \textrm{log} \left[  \prod \theta^{y_i} (1-\theta)^{1-y_i} \right] = \sum \textrm{log} \left[ \theta^{y_i} (1-\theta)^{1-y_i} \right] = \sum \left[  y_i \textrm{log} \theta + (1-y_i) \textrm{log} (1-\theta) \right]  </math>
+* <math> l ( \theta ) =  \left( \sum y_i \right) \textrm{log} \theta + \left( \sum (1-y_i) \right)  \textrm{log} (1-\theta)</math>
+* <math> l '( \theta ) =  \frac{1}{ \theta } \sum y_i - \frac{ 1 } {1- \theta } \sum (1-y_i) = 0</math>
+** Here we take derivative and set = 0.
+* <math>0 = \frac{ \sum y_i }{ \hat{ \theta } } - \frac{ \sum (1-y_i) }{ 1- \hat{ \theta } }</math>
+** The hat implies parameter estimate
+* <math>\hat{ \theta } = \frac{ }{ n } \sum y_i </math>
+* Approx Ci for 95% <math>\hat{ \theta } \pm 1.96 \sqrt{ \frac{ \hat{ \theta } (1- \hat{ \theta } ) } {n} } </math>
@@ Line 15: / Line 37: @@
 * Suppose we have samples from an exponential distribution with parameter lambda:
 ** <math>X_i \sim \textrm{Exp}( \lambda ) </math>, assuming i.i.d.
-* Recall that the density is the product of <math>f( x_{\textrm{undertilde}} | \lambda ) = \prod_{i=1}^n \lambda e^{- \lambda x_i } = \lambda^n e ^{-\lambda \sum x_i}</math>
+* Recall that the density is the product of <math>f( \mathbf{x} | \lambda ) = \prod_{i=1}^n \lambda e^{- \lambda x_i } = \lambda^n e ^{-\lambda \sum x_i}</math>
-* <math>L( \lambda | x_{\textrm{undertilde}} ) =  \lambda^n e ^{-\lambda \sum x_i}</math>
+* <math>L( \lambda | \mathbf{x} ) =  \lambda^n e ^{-\lambda \sum x_i}</math>

Difference between revisions of "Maximum Likelihood Estimation"

Revision as of 17:59, 11 May 2020

General

Bernoulli Distribution

Exponential Distribution

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools