Why (and why not) Bayesian Updating?

wei-dai

Why (and why not) Bayesian Updating?

post by Wei Dai (Wei_Dai) · 2009-11-16T21:27:43.855Z · LW · GW · Legacy · 26 comments

26 comments

the use of Bayesian belief updating with expected utility maximization may be just an approximation that is only relevant in special situations which meet certain independence assumptions around the agent's actions.

— Steve Rayhawk

For those who aren't sure of the need for an updateless decision theory, the paper Revisiting Savage in a conditional world by Paolo Ghirardato might help convince you. (Although that's probably not the intention of the author!) The paper gives a set of 7 axioms, based on Savage's axioms, which is necessary and sufficient for an agent's preferences in a dynamic decision problem to be represented as expected utility maximization with Bayesian belief updating. This helps us see in exactly which situations Bayesian updating works and why. (In many other axiomatizations of decision theory, the updating part is left out, and only expected utility maximization is derived in a static setting.)

A key assumption is Axiom 7, which the author calls "Consequentialism". I won't try to reproduce the mathematical notation here (see the page numbered 88 in this ungated PDF), but here's the informal explanation given in the paper:

This axiom says that the preference conditional on non-null A should not depend
on how the strategy f behaves in the counterfactual states of A^c (in other words,
it should only depend on the truncation f|_A).

This axiom is clearly violated in Vladmir Nesov's Counterfactual Mugging counter-example to Bayesian updating.

Another example that I used to motivate UDT involves indexical uncertainty. In Ghirardato's framework it's relatively easy to see what goes wrong when we try to apply it to indexical uncertainty. In that case, "states" in the formalism would have to be centered possible worlds, in other words an ordinary world-state plus a location. But if A above is a set of centered possible worlds, then after learning A, your preferences can still depend on how strategies behave in A^C since elements of A and A^C may belong to the same possible world.

If there is demand, I can try to give an informal/intuitive explanation of why Bayesian updating works (in the situations where it does). I was about to attempt that when I decided to do a Google search and found this paper.

P.S., I noticed a curiosity about Bayesian updating while thinking about it in the context of decision theory, and this seems like a good opportunity to point it out. In Ghirardato's decision theory, after learning A, you should use P_A to compute expected utilities, where P_A(x) is the conditional probability of x given A, or P(A ∩ x)/P(A). This apparently shows the relevance of Bayesian updating, but we get an equivalent theory if we instead define P_A(x) as the joint probability of A and x, or just P(A ∩ x). (Because when you compute the expected utilities of two choices f and g, upon learning A, the factor 1/P(A) enters into both computations the same way and can be removed without changing relative rankings.) The division by P(A) in the original definition seems to serve no purpose except to make P_A sum to 1.

So, theoretically, we don't need Bayesian updating even if our preferences do satisfy the Ghirardato axioms. We could use a decision procedure where our beliefs about something can only get weaker, and never any stronger, no matter what evidence we see, and that would be equivalent. Since that seems to be computationally cheaper (by avoiding the division operation), why do our beliefs not actually work like that?

26 comments

Comments sorted by top scores.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-11-16T21:43:59.478Z · LW(p) · GW(p)

We could use a decision procedure where our beliefs about something can only get weaker, and never any stronger, no matter what evidence we see, and that would be equivalent. Since that seems to be computationally cheaper (by avoiding the division operation), why do our beliefs not actually work like that?

1) Because over a lifetime of this, we would rapidly fall off the precision that can be directly represented by analogue neurons. This takes floating-point math with large exponents.

2) Because it is convenient to cache some mental quantities around the standard probability - e.g., having emotional strength go as the absolute probability avoids the need to do lots of comparisons each time.

3) Because evolution isn't that clever, of course!

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2009-11-17T01:11:29.275Z · LW(p) · GW(p)

4) With Bayesian updating, you only have to update the beliefs that are correlated with the evidence you observe. Beliefs that are independent of the evidence can stay constant. Trying to "save" on the division means you have to update just about every belief upon every new observation, which ends up being much more costly.

Oh well, it was fun trying to imagine being a mind whose beliefs only get weaker with time. :)

Replies from: Tyrrell_McAllister

↑ comment by Tyrrell_McAllister · 2009-11-17T01:50:33.717Z · LW(p) · GW(p)

With Bayesian updating, you only have to update the beliefs that are correlated with the evidence you observe.

What does "correlated" mean when talking about alternative kinds of updating?

ETA: And how would you know whether the belief and the evidence are correlated without performing the updating to check?

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2009-11-17T10:37:47.123Z · LW(p) · GW(p)

A and B are correlated if P(A ∩ B) != P(A) * P(B).

The idea is that you'd represent the prior using a data structure which allows you to easily determine which beliefs are correlated with a given evidence. I'm not an expert here, but I think this is what Bayesian networks are all about.

Replies from: SilasBarta, MichaelBishop

↑ comment by SilasBarta · 2009-11-17T18:18:23.519Z · LW(p) · GW(p)

A and B are correlated if P(A ∩ B) != P(A) * P(B).

Careful, though. That's the definition of when there's mutual information, and the term "correlated" can also be used to mean a "linear statistical correlation", which is not the same thing.

And before you roll your eyes, note that this entire LW article is based on equivocating between the two meanings! (See my comment. )

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2009-11-19T23:34:48.473Z · LW(p) · GW(p)

No. The data in the scatter-plot in that article contains no mutual information between the variables A and B, not merely zero product-moment correlation. I linked there to the data that are plotted; anyone is welcome to have a go at finding mutual information in them.

I challenge anyone to analyse these data and demonstrate substantial mutual information between A and B. If the data are insufficient for your favorite method of analysis, I can generate arbitrarily large quantities of it, and if I were using a quantum RNG instead of a PRNG, there would be absolutely no way to determine any connection between the two variables.

Despite that, there is one. It only shows up when the process from which these data are taken is sampled on a sufficiently short timescale, as in the other data file I linked to in that post.

Replies from: RobinZ, SilasBarta

↑ comment by RobinZ · 2009-11-19T23:46:38.950Z · LW(p) · GW(p)

Correct me if I'm wrong, but would the actual measure of the connection between A and B be more accurately summarized as K(A + B) < K(A) + K(B), then?

Replies from: SilasBarta

↑ comment by SilasBarta · 2009-11-20T16:06:02.022Z · LW(p) · GW(p)

I believe that's an equivalent way to express "H(X) - H(X|Y) > 0" and "P(A ∩ B) != P(A) * P(B)". Or at least, any one of the three can be derived from any of the others.

Note that the Kullback-Leibler divergence (a generalization of entropy) between X and Y is equivalent to the number of extra bits required to code data sampled from X when your compression algorithm is optimized for Y, which shows how these all relate.

↑ comment by SilasBarta · 2009-11-19T23:40:44.718Z · LW(p) · GW(p)

If you separate out the variables into simultaneous pairs, then yes, you've destroyed the mutual information.

But if someone is allowed to look at the timewise development of each variable, they would see the mutual information, which necessarily results from one causing the other! If A causes B, then, by knowing A, you require less data to describe B (than if you did not know or could not reference A). That's the very definition of mutual information.

You can't just say that because the simultaneous pairs are uncorrelated, there is no mutual information between the variables. You showed as much when you later demonstrated that the simultaneous pairs between a function and its derivative are uncorrelated. But who denies that learning a function tells you something about its derivative? (which would mean there's mutual information between the two...)

Replies from: Richard_Kennaway, Cyan

↑ comment by Richard_Kennaway · 2009-11-20T00:02:38.464Z · LW(p) · GW(p)

You're heading towards redefining correlation to mean causal connection.

When people actually do causal analysis (example, example, example) they perform specific calculations to detect various relationships among the variables. There are many different calculations they may do, which is the point of the first of those references, but we are not talking -- at least, I'm not -- about uncomputable Kolmogorov-based concepts (and even by that standard, the first data file, were it to use a QRNG, contains no mutual information). The moral that I was drawing was a practical one: certain very simple relationships between physical variables can generate time series containing no mutual information detectable by any of these methods. This suggests a substantial limitation of their practical applicability.

But who denies that learning a function tells you something about its derivative? (which would mean there's mutual information between the two...)

Specifics, please. Given the actual dynamical process generating those data (which is that B is the derivative of A, and A is a smoothly varying random variable), show me a mathematical definition of the mutual information between A and B, and a method of calculating it.

Replies from: SilasBarta, Cyan

↑ comment by SilasBarta · 2009-11-20T15:56:49.415Z · LW(p) · GW(p)

You're heading towards redefining correlation to mean causal connection.

Nope. I'm pointing out that "correlated" can mean "there exists a linear statistical correlation" or "there exists mutual information" -- but whichever you use, you need to be consistent. And at no point did I say it meant causal connection -- I just noted that that's one way mutual information can develop.

The moral that I was drawing was a practical one: certain very simple relationships between physical variables can generate time series containing no mutual information detectable by any of these methods. This suggests a substantial limitation of their practical applicability.

What you showed is that there is more than one way for two variables to be mutually informative, and if you limit yourself to a linear statistical regression on the simultaneous pairs, you might not find the mutual information. So what? If you know more than just the unordered simultaneous pairs, use that knowledge!

But who denies that learning a function tells you something about its derivative? (which would mean there's mutual information between the two...)

Specifics, please.

Sure. Let's use your point about derivatives. I tell you sin(x) = 4/5. Have I told you something about cos(x)? (And no it doen't matter that the cosine can have two values; you've still learned something.)

I tell you f(x) = sin(x) + cos(x). Have I told you something about f ' (x)?

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2009-11-30T13:20:16.525Z · LW(p) · GW(p)

Sure. Let's use your point about derivatives. I tell you sin(x) = 4/5. Have I told you something about cos(x)?

Yes.

I tell you f(x) = sin(x) + cos(x). Have I told you something about f ' (x)?

Yes.

But in real experiments, you're not given the underlying function, only observations of some of its values.

So, I tell you a time series for an unknown function f.

What have I told you about f'? What further information would you need to make a numerical calculation of the amount of information you now have about f'?

In the data file I originally linked to, there is not merely no linear relationship, but virtually no relationship whatsoever, discoverable by any means whatever, between the two columns, which tabulate f and f' for a certain stochastic function f. Mutual information, even in Kolmogorov heaven, is not present.

Replies from: SilasBarta

↑ comment by SilasBarta · 2009-11-30T15:38:14.056Z · LW(p) · GW(p)

But in real experiments, you're not given the underlying function, only observations of some of its values.

Yet you are given their time index values, meaning you have more than the unordered simultaneous pairs you presented in the example.

So, I tell you a time series for an unknown function f. What have I told you about f'? What further information would you need to make a numerical calculation of the amount of information you now have about f'?

You'd need to know the prior you have on the data. The KL divergence between your prior and the function tells you how much information you received by seeing the data. Typically, your probability distribution on the data shifts toward the function, while retaining a small probability mass on the function being very high frequency (higher than the Nyquist frequency on your sampling rate, for the pedants).

In the data file I originally linked to, there is not merely no linear relationship, but virtually no relationship whatsoever, discoverable by any means whatever, between the two columns, which tabulate f and f' for a certain stochastic function f. Mutual information, even in Kolmogorov heaven, is not present.

No means whatsoever? What about iterating through the possible causal maps and seeing which is consistent? (Probably not a good idea in the general case, but you only have a few variables here.)

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2009-11-30T19:41:59.856Z · LW(p) · GW(p)

Yet you are given their time index values, meaning you have more than the unordered simultaneous pairs you presented in the example.

The example data (here they are again) is a time series, not a set of unordered pairs. (Time is proportional to line number.)

No means whatsoever?

None whatsoever (assuming the random noise that drives the process is truly random, or at least unknowable -- guessing the pseudo-RNG algorithm and its seed doesn't count).

Consider this challenge open to anyone who thinks that there is mutual information between the two columns: calculate it. Prove the validity of the computation by calculating information about the second column given only the first, for a new file generated by the same method.

↑ comment by Cyan · 2009-11-20T00:29:14.431Z · LW(p) · GW(p)

Mutual information is the difference of marginal and conditional entropy (eq 4 of this): I(X,Y) = H(X) - H(X|Y)

Suppose X is a deterministic function of Y (e.g., Y is a function sampled from a stochastic process and X is its derivative). Then P(X|Y) is a degenerate distribution and the conditional entropy H(X|Y) is 0. Hence Y is maximally informative about X.

Replies from: Steve_Rayhawk

↑ comment by Steve_Rayhawk · 2009-11-20T02:20:57.488Z · LW(p) · GW(p)

I think the words Richard used in his question denoted the mutual information between the functions A and B, but I think he meant to ask about the mutual information between two time series datasets sampled from A and B over the same interval.

Replies from: SilasBarta, Richard_Kennaway, Cyan

↑ comment by SilasBarta · 2009-11-20T15:50:53.147Z · LW(p) · GW(p)

And my point was that this is an irrelevant comparison. When you look at the data sets, you want to know if they are mutually informative (if learning one can tell you about the other). A linear statistical correlation -- which Kennaway showed is absent -- is one way that the datasets can be mutually informative, but it is not the only way.

If you know the ordered, timewise development of each variable, you have extra information to use. If you discard this knowledge of the time ordering, and are left with just simultaneous pairs (pairs of the form [A(t0),B(t0)] ) then yes, as Kennaway points out, you're hosed. So?

↑ comment by Richard_Kennaway · 2009-11-20T07:35:07.066Z · LW(p) · GW(p)

One could ask both questions, but as Cyan points out, if you know the function A of this example exactly, then you also know B exactly. What do you know about B, though, when you know A only approximately, for example, by sampling a time series? As the sample time increases beyond the autocorrelation time of A then the amount of information you get about B converges to zero, in the sense that given all of both series up to A(t) and B(t-1), the distribution of B(t) is almost identical to its unconditional distribution.

I'm sure there is a general technical definition, BTW, even though I haven't seen it. This is not a rhetorical question.

↑ comment by Cyan · 2009-11-20T03:28:29.075Z · LW(p) · GW(p)

My whole argument rests on a weaker reed than I first appreciated, because the definition of mutual information I linked is for univariate random variables. When I searched for a definition of mutual information for stochastic processes, all I could really find was various people writing that it was a generalization of mutual information for random variables in "the natural way". But the point you bring up is actually a step in the direction of a stronger argument, not a weaker one. Sampling the function to get a time series makes a vector-valued random variable out of a stochastic process, and numerical differentiation on that random vector is still deterministic. My argument then follows from the definition of multivariate mutual information.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2009-11-20T07:38:30.457Z · LW(p) · GW(p)

Sampling the function to get a time series makes a vector-valued random variable out of a stochastic process, and numerical differentiation on that random vector is still deterministic.

This is not correct. Given the vector of all values of A sampled at intervals dt, the derivative of that vector -- that is, the time series for B -- is not determined by the vector itself, only by the complete trajectory of A. The longer dt is, the less the vector tells you about B.

Replies from: Cyan

↑ comment by Cyan · 2009-11-20T12:24:51.295Z · LW(p) · GW(p)

True. I was also assuming that

the process from which these data are taken is sampled on a sufficiently short timescale, as in the other data file [you] linked to in that post.

↑ comment by Cyan · 2009-11-20T03:36:35.731Z · LW(p) · GW(p)

If A causes B...

No need to bring up causality. It's enough that knowledge of A specifies B too.

Replies from: SilasBarta

↑ comment by SilasBarta · 2009-11-20T16:57:26.286Z · LW(p) · GW(p)

Yes, that's correct. I only mentioned causality to make my comment relevant to the context Kennaway brought up.

↑ comment by Mike Bishop (MichaelBishop) · 2009-11-17T19:41:00.561Z · LW(p) · GW(p)

Considering all the different combinations of things you might condition on, the task does not sound trivial.

comment by ctwardy · 2009-11-17T12:04:30.347Z · LW(p) · GW(p)

Regarding superfluous division by P(A): it is omitted in some applications of Bayesian reasoning.

Search theory begins with a probability map and updates the map based on searching and not finding. The calculations are done unnormalized both for speed and to preserve information. If the map started with 10,000 points distributed over all the regions in the search area, it is useful to know that searching has reduced the total to 1,000. It suggests you have exhausted the area and should reconsider your hypothesis that the subject/object/target is in this area.

There are equivalent approaches using normalization, such as including "Rest of World" as an area. But unnormalized is faster.

comment by Douglas_Knight · 2009-11-17T01:12:46.020Z · LW(p) · GW(p)

We could use a decision procedure where our beliefs about something can only get weaker, and never any stronger, no matter what evidence we see

That is reminiscent of collapse vs many worlds. When you collapse, you rescale back to unity, but when worlds split, they're less real than their parent.

Why (and why not) Bayesian Updating?

Contents

26 comments