Difference between revisions of "Bayesian Data Analysis"

From Colettapedia
Jump to navigation Jump to search
Line 102: Line 102:
 
* Count parents of each node to figure out size of conditional probability tables
 
* Count parents of each node to figure out size of conditional probability tables
 
* If use improper ordering, results in valid representation of joint probabilty funtion, but would require producing conditional probability tables which aren't natural/difficult to obtain experimentally. could also result in inflation of conditional tables / size of table representation is large compared to others
 
* If use improper ordering, results in valid representation of joint probabilty funtion, but would require producing conditional probability tables which aren't natural/difficult to obtain experimentally. could also result in inflation of conditional tables / size of table representation is large compared to others
 +
 
===Incremental Network Construction===
 
===Incremental Network Construction===
 
# Choose the set of relevant set of variables X that describe the domain
 
# Choose the set of relevant set of variables X that describe the domain
Line 109: Line 110:
 
## Set Parents(X) to some minimal set of existing of existing nodes such that the conditional independence is satisfied
 
## Set Parents(X) to some minimal set of existing of existing nodes such that the conditional independence is satisfied
 
## Define the conditional probability table
 
## Define the conditional probability table
 +
 
===inferences using belief networks===
 
===inferences using belief networks===
 
* diagnostic inferences (from effects to causes
 
* diagnostic inferences (from effects to causes
Line 114: Line 116:
 
* intercausal inferences
 
* intercausal inferences
 
* mixed inferences
 
* mixed inferences
 +
 +
 
==Information entropy - the measure of uncertainty==
 
==Information entropy - the measure of uncertainty==
 
* Information - the reduction in uncertainty derived from learning learning an outcome.
 
* Information - the reduction in uncertainty derived from learning learning an outcome.
Line 123: Line 127:
 
* The uncertainty contained in a probability distribution is the average log-probability of an event
 
* The uncertainty contained in a probability distribution is the average log-probability of an event
 
* Information entropy <math>H(p) = - E log( p_i ) = - \sum_{i=1}^{n}p_i log(p_i)</math>
 
* Information entropy <math>H(p) = - E log( p_i ) = - \sum_{i=1}^{n}p_i log(p_i)</math>
 +
* H= log(#of outcomes/states)
 
** n different possible events
 
** n different possible events
 
** each event i
 
** each event i
Line 129: Line 134:
 
* The measure of uncertainty decreases from 0.61 to 0.06 when the probabilities are p1=0.01 and p2=0.99. There's much less uncertainty on any given day.
 
* The measure of uncertainty decreases from 0.61 to 0.06 when the probabilities are p1=0.01 and p2=0.99. There's much less uncertainty on any given day.
 
* Maximum entropy - given what we know, what is the least surprising distribution
 
* Maximum entropy - given what we know, what is the least surprising distribution
 +
* Conditional entropy
 +
** <math>H(Y|X) \sum \limits_{x \foreach X} Pr(x) H(Y|X=x)</math>
  
 
===Divergence===
 
===Divergence===
Line 139: Line 146:
 
* H(p,q) is not equal to H(q,p). E.g., there is more uncertainty induced by using Mars to predict Earth, than vice versa. The reason is that going from Mars to Earth, Mars has so little water on its surface that we will be very surprised we most likely land on water on Earth
 
* H(p,q) is not equal to H(q,p). E.g., there is more uncertainty induced by using Mars to predict Earth, than vice versa. The reason is that going from Mars to Earth, Mars has so little water on its surface that we will be very surprised we most likely land on water on Earth
 
* If we use a distribution with high entropy to approximate an unknown distribution of true events, we will reduce the distance to the truth and therefore the error.
 
* If we use a distribution with high entropy to approximate an unknown distribution of true events, we will reduce the distance to the truth and therefore the error.
 +
* Cross-entropy = entropy + KL Divergence
 +
 
==Mutual Information==
 
==Mutual Information==
 
* [https://www.youtube.com/watch?v=U9h1xkNELvY Youtube vid]
 
* [https://www.youtube.com/watch?v=U9h1xkNELvY Youtube vid]

Revision as of 23:00, 1 October 2019

General

  • Model fitting can be thought of as data compression. Parameters summarize relationships among the data. These summaries compress the data into a simpler form, although with loss of information

Typical Statistical Modelling Questions

  • What is the average difference between treatment groups?
  • How strong is association between treatment and outcome?
  • Does the effect of a treatment depend on a covariate?
  • How much variation is there between groups?

Compare vs. Frequentist

  • Naive Bayes youtube vid
  • Pros:
    • Easy and fast to predict a class of test dataset
    • Naive Bayes classifier performs better compared to other models assuming independence
    • Performs well in the case of categorical input variables compared to numerical variables
  • Cons
    • zero frequency (solved by smoothing techniques like laplace estimation, or adding 1 to avoid dividing by zero)
    • Bad estimator - probability estimates are understood to not be taken too seriously
    • Assumption of independent predictors, which is almost never the case.
  • Applications
    • Credit scoring
    • Medical
    • Real time prediction
    • Multi-class predictions
    • Text classification, spam filtering, sentiment analysis
    • recommendation filtering
  • Gaussian naive bayes: assume continuous data has Gaussian distribution
  • The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space

Motivation

  • Reasoning under uncertainty
  • Bayesian model makes the best use of the information in the data, assuming the small world is an accurate description of the real world.
  • Model is always an incomplete representation of the real world.
  • The small world of the model itself versus the large word in which we want to model to operate.
  • Small world - self contained and logical. No pure surprises.
  • Performance of model in large world has to be demonstrated rather than logically deduced.
    • simulating new data from the model is a useful part of model criticism.
  • In contrast animals use heuristics that take adaptive shortcuts and may outpuerform rigorous bayesian analysis once costs of information gathering and processing are taken into account Once you already know what information is useful, being fully Bayesian is a waste.

Description

  • Bayesian data analysis - producing a story for how the data (observations) came to be.
  • Bayesian inference = counting and comparing the ways things can happen/possibilities.
  • In order to make good inference on what actually happened, it helps to consider everything that could have happened.
  • A quantitative ranking of hypotheses. Counting paths is a measure of relative plausibility
  • Prior information: instead of building up a possibility tree from scratch given a new observation, it is mathematically equivalent to multiply the prior counts by the new count for each conjecture IF the new observation is logically independent of the previous observations.
    • Multiplication is just a shortcut to enumerating and counting up all the paths through the garden of possibilities
    • A.k.A., joint probability distribution
  • Principle of indifference - when there's no reason to say that one conjecture is more reasonable than the other
  • The probability of rain and cold both happening on a given day is equal to (probability of rain when it's cold) times (probability that it's cold)

Definitions

  • Parameter - Represents different conjecture. A way of indexing the possible explanations of the data. A Bayesian machine's job is to describe what the data tells us about an unknown parameter.
  • Liklihood - the relative number of ways that parameter of a given value can produce the observed data.
  • Prior probability - prior plausibility. Engineering assumptions chosen to help the machine learn.
    • regularizing prior, weakly informative prior: Flat prior is common but hardly the best prior. Priors that gently nudge the machine usually improve inference. Tell the model not to get too excited by the data.
    • Penalized liklihood - constrain parameters to reasonable ranges. Values of p=0 and p=1 are highly implausible
    • Subjective bayesian - used in philosophy and economics, rarely used in natural and social sciences.
    • Alter the prior to see how sensitive inference is to that assumption of the prior.
  • posterior probability - updated plausibility
  • posterior distribution relative plausibility of different parameter estimates conditional on the data.
  • Randomization - processing something so we know almost nothing about its arrangement. A truly randomized deck of cards will have an ordering that has high information entropy.
  • A story for how your observed data came to be may be descriptive or causal. Sufficient for specifying an algorithm for simulating new data.

Math

  • Average likelihood of the data - Averaged over the prior. It's job is to standardize the posterior so that it sums (integrates) to 1. The average likelihood just standardizes the counts so they sum to one.
  • In practice there's is only interest in the numerator of that fraction, because the denominator does not depend on C, and the values on feature are given, so the denominator is effectively constant.
  • The numerator is equivalent to the joint probability model. The posterior is proportional to the product of the prior and the likelihood. You can think of prior and likelihood of two signals multiplied together. We condition the prior on the data.
  • If we assume each feature is conditionally independent of every other, then the joint model can be expressed as

  • Classifier combines probability model with a decision rule, i.e. maximum a posteriori

Conditional probability

  • What is the probability that a given observation D belongs to a given class C,
  • "The probability of A under the condition B"
  • There need not be a causal relationship
  • Compare with UNconditional probability
  • If , then events are independent, knowledge about either event does not give information on the other. Otherwise,
  • Don't falsely equate and
  • Defined as the quotient of the joint of events A and B and the probability of B: , where numerator is the probability that both events A and B occur.
  • Joint probability

Bayesian Network

  • Bayesian network is way to reduce size of representation, a "succinct way" of representing distribution
  • store probability distribution explicitly in a table
  • x1 .. x10 are booleans
  • what is size of table for set of vars P[ x1 ... x10] = 2^n
  • how can rewrite joint pdf P[x1, x2, ..., x10]= P[x1| x2, ..., x10] * P[x2, ..., x10]
  • = P[x1| x2, ..., x10] * P[x2 | x3, ..., x10] ... P[Xn-1|Xn]*P[Xn]
  • P[Xi|Xi+1, ..., Xn] = P[Xi] if Xi is totally independent of the others
  • sometime can also be conditionally independent, only dependent on a subset of the other variables
  • the variable on which P[Xi] depends "subsumes" the other variables
  • belief network - order of variables matters when setting up dependencies in belief network.
  • Count parents of each node to figure out size of conditional probability tables
  • If use improper ordering, results in valid representation of joint probabilty funtion, but would require producing conditional probability tables which aren't natural/difficult to obtain experimentally. could also result in inflation of conditional tables / size of table representation is large compared to others

Incremental Network Construction

  1. Choose the set of relevant set of variables X that describe the domain
  2. Choose an ordering for the variables (very important step)
  3. While there are variables left:
    1. dequeue variable X off the queue and add node
    2. Set Parents(X) to some minimal set of existing of existing nodes such that the conditional independence is satisfied
    3. Define the conditional probability table

inferences using belief networks

  • diagnostic inferences (from effects to causes
  • causal inferences (given symptoms, what is probability of disease)
  • intercausal inferences
  • mixed inferences


Information entropy - the measure of uncertainty

  • Information - the reduction in uncertainty derived from learning learning an outcome.
  • The measure of uncertainty should be
    • continuous
    • larger when there is more kinds of events to predict
    • the sum of all the separate uncertainties
  • How hard is it to hit the target?
  • The uncertainty contained in a probability distribution is the average log-probability of an event
  • Information entropy
  • H= log(#of outcomes/states)
    • n different possible events
    • each event i
    • probability of each event p_i
  • For two events with p1 = 0.3 and p2 = 0.7, then
  • The measure of uncertainty decreases from 0.61 to 0.06 when the probabilities are p1=0.01 and p2=0.99. There's much less uncertainty on any given day.
  • Maximum entropy - given what we know, what is the least surprising distribution
  • Conditional entropy
    • Failed to parse (unknown function "\foreach"): {\displaystyle H(Y|X) \sum \limits_{x \foreach X} Pr(x) H(Y|X=x)}

Divergence

  • Divergence - the additional uncertainty induced by using probabilities from one distribution to describe another distribution
  • How we use information entropy to say how far a model is from the target
  • Divergence is the average difference in log probability between the target and the model.
  • Divergence helps us contrast different approximations to p
  • Use divergence to compare accuracy of models
  • Divergence is measuring how far q is from the target p in units of entropy
  • H(p,q) is not equal to H(q,p). E.g., there is more uncertainty induced by using Mars to predict Earth, than vice versa. The reason is that going from Mars to Earth, Mars has so little water on its surface that we will be very surprised we most likely land on water on Earth
  • If we use a distribution with high entropy to approximate an unknown distribution of true events, we will reduce the distance to the truth and therefore the error.
  • Cross-entropy = entropy + KL Divergence

Mutual Information

  • Youtube vid
  • MI concerns the outcome of two random variables
  • MI measures reduction in uncertainty for predicting parts of outcome of a system after we observe the outcome of the other parts of the system.
  • If we know the value of one of the random variables in a system, there is a corresponding reduction in uncertainty for predicting the other one
  • MI measures that reduction in uncertainty
  • Entropy = ideal measure of uncertainty in our system
  • Entropy = a measure of information content of some random process
  • Entropy = how much information do we gain by knowing the outcome of some process
  • For two discrete processes:
    • The joint distribution divided through by the product of the marginal distributions
    • If we have two continuous processes, both of the sums become integrals
    • If X and Y are independent, then Pr(X,Y) simplifies to Pr(X)*Pr(Y). The term inside the log becomes 1, and the log of 1 is zero, so the mutual information is zero for independent random variables
      • The outcome of one variable tells us nothing about the outcome of another variable.
      • There's no reduction in uncertainty in the system for var X for knowing the outcome of var Y
  • For completely dependent case
    • The reduction of uncertainty of one of the variables is equal to its marginal uncertainty
    • Equal to one bit