Bayesian Data Analysis

General

Use Bayes theorem to update our information. Start with prior beliefs, collect data, then condition on the data to lead to our posterior beliefs.

"All models are wrong, some are useful. The question you're dealing with in a model is not is it gonna be right. A model by definition is a simplification of reality. When you're modelling something like coronavirus where you know almost nothing about it early on in an epidemic, you are gonna be making assumptions that are going to turn out to be wrong. Are we able to structure the conversation where things can be wrong, and people can update, and listen and evolve." Ezra Klein on the Weeds

IHME downward revision from 200k to 60k. They are cope with the lack of information about the underlying virus in. To build a model about this disease, you need to know a lot about its properties. IHME approach is unsound, didn't work. In 70% of cases in the states, the outcome is outside of 95% CI. That's not just imprecision. Model was built on Wuhan and N. Italy, predicts NYC pretty well. New Orleans didn't work, San Fran not nearly as bad. Why isn't Florida a disaster. Thing models have trouble with is human behavior. Human behavior changed in reaction to it. People are washing hands a lot, maybe you stopped critical mass, hard shift into a new equilibrium that a model build on the first hits can't predict for. Question about the weather and ambient temperature, UV light, we don't know all the relevant issues. People in Italy kiss each other customarily, kiss greeting counties worse than handshake countries.

Model fitting can be thought of as data compression. Parameters summarize relationships among the data. These summaries compress the data into a simpler form, although with loss of information

Typical Statistical Modelling Questions

What is the average difference between treatment groups?
How strong is association between treatment and outcome?
Does the effect of a treatment depend on a covariate?
How much variation is there between groups?

Comparison vs other statistical frameworks

Naive Bayes youtube vid
Pros:
- Easy and fast to predict a class of test dataset
- Naive Bayes classifier performs better compared to other models assuming independence
- Performs well in the case of categorical input variables compared to numerical variables
- Good if you're getting data one at a time and updating posterior.

Cons
- zero frequency (solved by smoothing techniques like Laplace estimation, or adding 1 to avoid dividing by zero)
- Bad estimator - probability estimates are understood to not be taken too seriously
- Assumption of independent predictors, which is almost never the case.
Applications
- Credit scoring
- Medical
- Real time prediction
- Multi-class predictions
- Text classification, spam filtering, sentiment analysis
- recommendation filtering
Gaussian naive Bayes: assume continuous data has Gaussian distribution
The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space

Frameworks for defining probability

Classical/frequentist/sampling theory based- outcomes that are (perhaps defined to have) equally likely have equal probabilities

Frequentist - have a hypothetical infinite series of events and we look at relative frequencies, which is good when you can take data, but not so good to define P(rain), in which we would have to look at an infinite sequence of tomorrow and see what fraction of tomorrow has rain. Tries to be objective in how it defines probability. Either the die is fair or not, a range of probabilities doesn't make sense in frequentist framework, rolling the die a bunch of times doesn't change whether or not the die is fair.
- (maximum liklihood) estimators for parameters (betas) given by line of best fit. Estimator params have a hat. Betas are mean of a normal distribution centered at betas, Standard deviation of that normal is given by residual standard deviation
- Key to frequentist: Calculate the sampling distribution of the estimator, calculate whether our value of the estimator is compatible with the hypothetical values

Bayesian - personal perspective. It's your measure of uncertainty, takes into account what you know of a particular problem. Inherently subjective approach to probability, but mathematically rigorous and leads to more intuitive results than frequentist.

Motivation

Reasoning under uncertainty
Bayesian model makes the best use of the information in the data, assuming the small world is an accurate description of the real world.
Model is always an incomplete representation of the real world.
The small world of the model itself versus the large world in which we want to model to operate.
Small world - self contained and logical. No pure surprises.
Performance of model in large world has to be demonstrated rather than logically deduced.
- simulating new data from the model is a useful part of model criticism.
In contrast animals use heuristics that take adaptive shortcuts and may outperform rigorous bayesian analysis once costs of information gathering and processing are taken into account Once you already know what information is useful, being fully Bayesian is a waste.

Description

Bayesian data analysis - producing a story for how the data (observations) came to be.
Bayesian inference = counting and comparing the ways things can happen/possibilities.
In order to make good inference on what actually happened, it helps to consider everything that could have happened.
A quantitative ranking of hypotheses. Counting paths is a measure of relative plausibility
Prior information: instead of building up a possibility tree from scratch given a new observation, it is mathematically equivalent to multiply the prior counts by the new count for each conjecture IF the new observation is logically independent of the previous observations.
- Multiplication is just a shortcut to enumerating and counting up all the paths through the garden of possibilities
- A.k.A., joint probability distribution
Principle of indifference - when there's no reason to say that one conjecture is more reasonable than the other
The probability of rain and cold both happening on a given day is equal to (probability of rain when it's cold) times (probability that it's cold)
Game where you construct a series of bets where you're guaranteed to lose money ("Dutch book")

Algorithm

Only edge cases can an analytical solution for posterior be derived, i.e., get a distribution using algebra. Number of models that falls into this case is too limited to be interesting, i.e. just linear models.
- Rather you use numerical methods, drawing samples from the posterior distribution via monte carlo from a hypothetical sample distribution and calculate quantities of interest, like mean, standard deviation, etc.
Hamiltonian Monte Carlo method is more efficient than Metropolis & Gibbs sampling
Probabilistic programming languages
- BUGS - Bayesian inference Using Gibbs Sampling
- JAGS - "Just Another Gibbs Samples" reimplementation of BUGS in C++
- Stan - named after Stanislav Ulum - uses Hamiltonian MCMC, not Gibbs
- brms - provides the function brm() - write one line of code instead of 10ish lines of Jags/Stan code

Definitions

Parameter - Represents different conjecture. A way of indexing the possible explanations of the data. A Bayesian machine's job is to describe what the data tells us about an unknown parameter.
Likelihood - the relative number of ways that parameter of a given value can produce the observed data.
Prior probability - prior plausibility. Engineering assumptions chosen to help the machine learn.
- regularizing prior, weakly informative prior: Flat prior is common but hardly the best prior. Priors that gently nudge the machine usually improve inference. Tell the model not to get too excited by the data.
- Penalized likelihood - constrain parameters to reasonable ranges. Values of p=0 and p=1 are highly implausible
- Subjective bayesian - used in philosophy and economics, rarely used in natural and social sciences.
- Alter the prior to see how sensitive inference is to that assumption of the prior.
posterior probability - updated plausibility
- p( unknowns | knowns ) - conditional/posterior joint probability over all variables
Posterior distribution relative plausibility of different parameter estimates conditional on the data.
Randomization - processing something so we know almost nothing about its arrangement. A truly randomized deck of cards will have an ordering that has high information entropy.
A story for how your observed data came to be may be descriptive or causal. Sufficient for specifying an algorithm for simulating new data.

Math

$Pr(C_{k}\mid \mathbf {x} )={\frac {Pr(C_{k})\ p(\mathbf {x} \mid C_{k})}{Pr(\mathbf {x} )}}={\text{posterior}}={\frac {{\text{prior}}\times {\text{likelihood}}}{\text{Average Likelihood}}}$
Average likelihood of the data - Averaged over the prior. It's job is to standardize the posterior so that it sums (integrates) to 1. The average likelihood just standardizes the counts so they sum to one.
$Pr(\mathbf {x} )=E(Pr(C_{k}\mid x))$
In practice there's is only interest in the numerator of that fraction, because the denominator does not depend on C, and the values on feature $x_{i}$ are given, so the denominator is effectively constant.
The numerator is equivalent to the joint probability model. The posterior is proportional to the product of the prior and the likelihood. You can think of prior and likelihood of two signals multiplied together. We condition the prior on the data.
If we assume each feature is conditionally independent of every other, then the joint model can be expressed as

${\begin{aligned}p(C_{k}\mid x_{1},\dots ,x_{n})&\varpropto p(C_{k},x_{1},\dots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}$

Classifier combines probability model with a decision rule, i.e. maximum a posteriori

Conditional and Joint probability

Conditional probability looks at a subsegment of population
Conditional uses pipe, joint uses upside down U (intersection)
What is the probability that a given observation D belongs to a given class C, $p(C\mid D)$
"The probability of A under the condition B" $p(A\mid B)$
There need not be a causal relationship
Compare with UNconditional probability $p(A)$
If $p(A\mid B)=p(A)$ , then events are independent, knowledge about either event does not give information on the other. Otherwise, $P(A\cap B)=P(A)P(B).$
Don't falsely equate $p(A\mid B)$ and $p(B\mid A)$
Defined as the quotient of the joint of events A and B and the probability of B: $P(A\mid B)={\frac {P(A\cap B)}{P(B)}},$ , where numerator is the probability that both events A and B occur.
Joint probability $P(A\cap B)=P(A\mid B)P(B)$

Bayesian Predictive Distributions

In the long run, data should drown out the prior

Prior Predictive Distribution

$P(\theta \leq c)$ $P(\theta \leq c)$ for all $c\in \mathbb {R}$ $c\in \mathbb {R}$ on the real line
- Define a cumulative distribution function for the parameter
Compute/Calibrate a predictive intervals such that 95% of new data points will occur on this interval
- This is an interval for the data (y or x) rather than for theta
- $f(y)=\int f(y|\theta )f(\theta )d\theta =\int f(y,\theta )d\theta$
- Integrating the joint density or y and theta, integrating out theta to get the marginal for y.
- This is our prior predictive before we observe any data

Binomial Example

We are going to flip a coin 10 times and count the number of heads
Question: How many heads do we predict we're going to see?
$X=\sum _{i=1}^{10}Y_{i}$
If we think that all possible probabilities are equally likely, we can put a prior for theta that's flat over the interval 0 to 1
- $f(\theta )=I_{\{0\leq \theta \leq 1\}}$
- X can take possible values 0, 1, 2, 3, ..., 10
$f(x)=\int f(x|\theta )f(\theta )d\theta$ $f(x)=\int f(x|\theta )f(\theta )d\theta$
- Predictive distribution is the integral of the likelihood times the prior
- We have a binomial likelihood
$f(x)=\int _{0}^{1}{\frac {10!}{x!(10-x)!}}\theta ^{x}(1-\theta )^{10-x}(1)d\theta$ $f(x)=\int _{0}^{1}{\frac {10!}{x!(10-x)!}}\theta ^{x}(1-\theta )^{10-x}(1)d\theta$
- (1) is our prior
- What's the difference between Binomial density and Bernoulli density? Binomial is just the count of successes, whereas Bernoulli's would deal with the ordering
- For most of the analyses we're doing, where we're interested in theta rather than x, the binomial and the Bernoulli are interchangeable because the part that depends on theta (the part outside of the n choose x) is the same
$n!=\Gamma (n+1)$ $n!=\Gamma (n+1)$
- gamma function is a generalization of factorial function that can be used for non-integers
If $z\sim {\textrm {Beta}}(\alpha ,\beta )$ $z\sim {\textrm {Beta}}(\alpha ,\beta )$ , then $f(z)={\frac {\Gamma (\alpha +\beta )}{\Gamma (\alpha )\Gamma (\beta )}}z^{\alpha -1}(1-z)^{\beta -1}$ $f(z)={\frac {\Gamma (\alpha +\beta )}{\Gamma (\alpha )\Gamma (\beta )}}z^{\alpha -1}(1-z)^{\beta -1}$
- Now the goal is to simplify the interior of the integral to look like Beta distribution
${\begin{aligned}f(x)&=\int _{0}^{1}{\frac {\Gamma (11)}{\Gamma (x+1)\Gamma (11-x)}}\theta ^{(x+1)-1}(1-\theta )^{(11-x)-1}(1)d\theta \\&={\frac {\Gamma (11)}{\Gamma (12)}}\int _{0}^{1}{\frac {\Gamma (12)}{\Gamma (x+1)\Gamma (11-x)}}\theta ^{(x+1)-1}(1-\theta )^{(11-x)-1}d\theta \end{aligned}}$ ${\begin{aligned}f(x)&=\int _{0}^{1}{\frac {\Gamma (11)}{\Gamma (x+1)\Gamma (11-x)}}\theta ^{(x+1)-1}(1-\theta )^{(11-x)-1}(1)d\theta \\&={\frac {\Gamma (11)}{\Gamma (12)}}\int _{0}^{1}{\frac {\Gamma (12)}{\Gamma (x+1)\Gamma (11-x)}}\theta ^{(x+1)-1}(1-\theta )^{(11-x)-1}d\theta \end{aligned}}$
- Now everything in the integral is a Beta density function, with $\alpha =(x+1)$ and $\beta =(11-x)$ and all densities integrate up to 1
$f(x)={\frac {\Gamma (11)}{\Gamma (12)}}(1)={\frac {10!}{11!}}={\frac {1}{11}}$ $f(x)={\frac {\Gamma (11)}{\Gamma (12)}}(1)={\frac {10!}{11!}}={\frac {1}{11}}$ for $x\in \{0,1,2,...,10\}$ $x\in \{0,1,2,...,10\}$
- Thus we see that if we start with a uniform prior, we then end up with a discrete uniform predictive density for X. For all possible coins or all possible probabilities are equally likely, then all possible X outcomes are equally likely.

Posterior Predictive Distribution

E.g., what's our predicted distribution for the second coin flip, after we've saw a head on the first flip?
$f(y_{2}|y_{1})=\int f(y_{2}|\theta y_{1})f(\theta |y_{1})d\theta$
$Y_{2}\perp Y_{1}\implies \int f(y_{2}|\theta )f(\theta |y_{1})d\theta$ $Y_{2}\perp Y_{1}\implies \int f(y_{2}|\theta )f(\theta |y_{1})d\theta$
- Since y2 is independent from y1, the above expression simplifies
- Looks like the prior predictive, except we're using the posterior distribution for theta instead of the prior distribution

Binomial Example, continued

If we assumed uniform distribution prior, and observe one flip as head, what do we predict for the second flip?
- Head coming up on first flip gives us some information about the coin. We now think it's more likely we're going to get a second head, because it's more likely that theta is at least 0.5, and possible larger than 0.5.
$f(y_{2}|Y_{1}=1)=\int _{0}^{1}\theta ^{y_{2}}(1-\theta )^{1-y_{2}}2\theta d\theta =\int _{0}^{1}2\theta ^{(y_{2}+1)}(1-\theta )^{1-y_{2}}d\theta$ $f(y_{2}|Y_{1}=1)=\int _{0}^{1}\theta ^{y_{2}}(1-\theta )^{1-y_{2}}2\theta d\theta =\int _{0}^{1}2\theta ^{(y_{2}+1)}(1-\theta )^{1-y_{2}}d\theta$
- Since there are only two possible values for y2, it's easy to split the expression into two separate expressions for each of the values
- $P(Y_{2}=1|Y_{1}=1)=\int _{0}^{1}2\theta d\theta ={\frac {2}{3}}$
- $P(Y_{2}=0|Y_{1}=1)={\frac {1}{3}}$ - which is the complement
The posterior is a combination of the information in the prior and the information in the data. In this case our prior is like having two data points, one head and one tail. Saying we have a uniform prior for theta is equivalent in an information sense to having observed one head and one tail. Thus when we do observe one head, it's like we now have seen two heads and one tail.

Conjugate Prior

conjugacy - when the posterior is in the same distribution family as the prior.

Bayesian Network

Bayesian network is way to reduce size of representation, a "succinct way" of representing distribution
store probability distribution explicitly in a table
x1 .. x10 are booleans
what is size of table for set of vars P[ x1 ... x10] = 2^n
how can rewrite joint pdf P[x1, x2, ..., x10]= P[x1| x2, ..., x10] * P[x2, ..., x10]
= P[x1| x2, ..., x10] * P[x2 | x3, ..., x10] ... P[Xn-1|Xn]*P[Xn]
P[Xi|Xi+1, ..., Xn] = P[Xi] if Xi is totally independent of the others
sometime can also be conditionally independent, only dependent on a subset of the other variables
the variable on which P[Xi] depends "subsumes" the other variables
belief network - order of variables matters when setting up dependencies in belief network.
Count parents of each node to figure out size of conditional probability tables
If use improper ordering, results in valid representation of joint probabilty funtion, but would require producing conditional probability tables which aren't natural/difficult to obtain experimentally. could also result in inflation of conditional tables / size of table representation is large compared to others

Incremental Network Construction

Choose the set of relevant set of variables X that describe the domain
Choose an ordering for the variables (very important step)
While there are variables left:
1. dequeue variable X off the queue and add node
2. Set Parents(X) to some minimal set of existing of existing nodes such that the conditional independence is satisfied
3. Define the conditional probability table

inferences using belief networks

diagnostic inferences (from effects to causes
causal inferences (given symptoms, what is probability of disease)
intercausal inferences
mixed inferences

Information entropy - the measure of uncertainty

Information - the reduction in uncertainty derived from learning learning an outcome.
The measure of uncertainty should be
- continuous
- larger when there is more kinds of events to predict
- the sum of all the separate uncertainties
How hard is it to hit the target?
The uncertainty contained in a probability distribution is the average log-probability of an event
Information entropy $H(p)=-Elog(p_{i})=-\sum _{i=1}^{n}p_{i}log(p_{i})$
H= log(#of outcomes/states)
- n different possible events
- each event i
- probability of each event p_i
For two events with p1 = 0.3 and p2 = 0.7, then $H(p)=-(p_{1}log(p_{1})+p_{2}log(p_{2}))\approx 0.61$
The measure of uncertainty decreases from 0.61 to 0.06 when the probabilities are p1=0.01 and p2=0.99. There's much less uncertainty on any given day.
Maximum entropy - given what we know, what is the least surprising distribution
Conditional entropy
- $H(Y|X)=\sum \limits _{x\in X}Pr(x)H(Y|X=x)$
Chain rule of entropy
- $H(X;Y)=H(X)+H(Y|X)$
- Entropy of a pair of RVs = entropy of one + conditional entropy of the other

Divergence

Relative entropy
- Measure of distance between two distributions
- A measure of inefficiency of assuming that distribution is q when the true distribution is p
- If we use distribution q to construct code, we need H(p) +D(p|q) bits on average to describe the RV
Divergence - the additional uncertainty induced by using probabilities from one distribution to describe another distribution
How we use information entropy to say how far a model is from the target
Divergence is the average difference in log probability between the target and the model.
Divergence helps us contrast different approximations to p
Use divergence to compare accuracy of models
Divergence is measuring how far q is from the target p in units of entropy
H(p,q) is not equal to H(q,p). E.g., there is more uncertainty induced by using Mars to predict Earth, than vice versa. The reason is that going from Mars to Earth, Mars has so little water on its surface that we will be very surprised we most likely land on water on Earth
If we use a distribution with high entropy to approximate an unknown distribution of true events, we will reduce the distance to the truth and therefore the error.
Cross-entropy = entropy + KL Divergence

Mutual Information

In the Venn diagram of overlapping entropies, MI is the slice in the middle. It's the intersection of the information in X and the information in Y.
Youtube vid
MI concerns the outcome of two random variables
MI measures reduction in uncertainty for predicting parts of outcome of a system after we observe the outcome of the other parts of the system.
If we know the value of one of the random variables in a system, there is a corresponding reduction in uncertainty for predicting the other one
MI measures that reduction in uncertainty
Entropy = ideal measure of uncertainty in our system
Entropy = a measure of information content of some random process
Entropy = how much information do we gain by knowing the outcome of some process
For two discrete processes:
- $I(X,Y)=H(X)-H(X|Y)=\sum \limits _{x\in X}\sum \limits _{y\in Y}Pr(X,Y)\log {\dfrac {Pr(X,Y)}{Pr(X)Pr(Y)}}$
- The joint distribution divided through by the product of the marginal distributions
- If we have two continuous processes, both of the sums become integrals
- If X and Y are independent, then Pr(X,Y) simplifies to Pr(X)*Pr(Y). The term inside the log becomes 1, and the log of 1 is zero, so the mutual information is zero for independent random variables
  - The outcome of one variable tells us nothing about the outcome of another variable.
  - There's no reduction in uncertainty in the system for var X for knowing the outcome of var Y
For completely dependent case
- The reduction of uncertainty of one of the variables is equal to its marginal uncertainty
- Equal to one bit

Bayesian Data Analysis

Contents