Comments on "When Bayesian Inference Shatters"?

crystalist

Comments on "When Bayesian Inference Shatters"?

post by Crystalist · 2015-01-07T22:56:55.393Z · LW · GW · Legacy · 31 comments

31 comments

I recently ran across this post, which gives a lighter discussion of a recent paper on Bayesian inference ("On the Brittleness of Bayesian Inference"). I don't understand it, but I'd like to, and it seems like the sort of paper other people here might enjoy discussing.

I am not a statistician, and this summary is based on the blog post (I haven't had time to read the paper yet) so please discount my summary accordingly: It looks like the paper focuses on the effects of priors and underlying models on the posterior distribution. Given a continuous distribution (or a discrete approximation of one) to be estimated from finite observations (of sufficiently high precision), and finite priors, the range of posterior estimates is the same as the range of the distribution to be estimated. Given models that are arbitrarily close (I'm not familiar with the total variance metric, but the impression I had was that, for finite accuracy, they produce the same observations with arbitrarily similar probability), you can have posterior estimates that are arbitrarily distant (within the range of the distribution to be estimated) given the same information. My impression is that implicitly relying on arbitrary precision of a prior can give updates that are diametrically opposed to the ones you'd get with different, but arbitrarily similar priors.

First, of course, I want to know if my summary's accurate, misses the point, or wrong.

Second, I'd be interested in hearing discussions of the paper in general and whether it might have any immediate impact on practical applications.

Some other areas of discussion that would be of interest to me: I'm also not entirely sure what 'sufficiently high precision' would be. I also have only a vague idea of the circumstances where you'd be implicitly relying on the arbitrary precision of a prior. I'm also just generally interest in hearing what people more experienced/intelligent than I am might have to say here.

31 comments

Comments sorted by top scores.

comment by IlyaShpitser · 2015-01-08T01:44:55.125Z · LW(p) · GW(p)

This is an interesting paper, thanks for linking it (posting here to remind myself to read it carefully later).

Replies from: shminux

↑ comment by Shmi (shminux) · 2015-01-08T04:14:50.567Z · LW(p) · GW(p)

Please post your expert opinion once you do.

comment by Richard_Kennaway · 2015-01-08T22:12:19.891Z · LW(p) · GW(p)

For those wanting to see the proofs of the authors theorems, they are in this other paper of theirs.

My immediate reaction to the notched distributions that they use to exemplify their results is that it's cheating -- as indeed say all, including the authors. The priors giving pathological posteriors are chosen in response to the data. Any measure of closeness that puts these distributions close to the un-notched distribution is a silly measure of their suitability as priors. However, I don't have a mathematical expression of what the right measure would be, and no-one that I've seen commenting has explicitly set out a reason for dismissing these "posterior priors", although Entsophy of course does dismiss them on the blog page Cyan linked. (Whatever happened to Entsophy, BTW?) In fact, the authors defend these priors against the charge of disreputability, by arguing that varying the priors in response to the data is exactly what is done in Bayesian sensitivity analysis.

If instead of illustrating their theorems, I imagine a real-world scenario of someone using one of these notched distributions, I get something like this:

I pray to God to show himself by a miracle, then use a quantum mechanical device to generate a string of one million random digits R. I look at these digits and construct a prior such that P(God|R) is high, while P(God|R') is low for all other R'. This prior is very close by various measures to the one that assigns uniformly low probability to the existence of God whatever string of digits I got.

Something is going wrong here, but I don't think it's Bayesian inference. Russell's teapot seems relevant.

The authors write here:

We do not think that this is the end of Bayesian Inference

but go on to argue that it should always be subservient to non-Bayesian reasoning.

comment by XFrequentist · 2015-01-08T05:18:25.802Z · LW(p) · GW(p)

I call forth the mighty Cyan!

Replies from: Cyan

↑ comment by Cyan · 2015-01-08T18:05:09.999Z · LW(p) · GW(p)

I like it when I can just point folks to something I've already written.

The upshot is that there are two things going on here that interact to produce the shattering phenomenon. First, the notion of closeness permits some very pathological models to be considered close to sensible models. Second, the optimization to find the worst-case model close to the assumed model is done in a post-data way, not in prior expectation. So what you get is this: for any possible observed data and any model, there is a model "close" to the assumed one that predicts absolute disaster (or any result) just for that specific data set, and is otherwise well-behaved.

As the authors themselves put it:

The mechanism causing this “brittleness” has its origin in the fact that, in classical Bayesian Sensitivity Analysis, optimal bounds on posterior values are computed after the observation of the specific value of the data, and that the probability of observing the data under some feasible prior may be arbitrarily small... This data dependence of worst priors is inherent to this classical framework and the resulting brittleness under finite-information can be seen as an extreme occurrence of the dilation phenomenon (the fact that optimal bounds on prior values may become less precise after conditioning) observed in classical robust Bayesian inference.

Replies from: IlyaShpitser

↑ comment by IlyaShpitser · 2015-01-08T18:55:30.315Z · LW(p) · GW(p)

Thanks for your link!

comment by calef · 2015-01-07T23:52:44.546Z · LW(p) · GW(p)

Here's a discussion of the paper by the authors. For a sort of critical discussion of the result, see the comments in this blog post.

Replies from: Strilanc

↑ comment by Strilanc · 2015-01-08T00:02:38.968Z · LW(p) · GW(p)

This is an attempt at a “plain Jane” presentation of the results discussed in the recent arxiv paper

... [No concrete example given] ...

Urgh...

comment by Shmi (shminux) · 2015-01-07T23:45:31.868Z · LW(p) · GW(p)

I don't know much about Bayesian Inference, but I am familiar with the well-posedness problem the paper seems to allude to. The authors seem to claim that in the continuous limit the inference problem is ill-posed, specifically, that the solution's behavior does not change continuously with the initial conditions. "Chaotic" is the corresponding popular meme. If true, it means that the continuous version of the Aumann's agreement theorem is unstable: a tiny difference in priors may result in a complete disagreement. Which is very interesting and has direct applications to the FAI research.

EDIT: the relevant quote (emphasis mine):

This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.

Replies from: JoshuaZ, DanielLC

↑ comment by JoshuaZ · 2015-01-07T23:48:07.038Z · LW(p) · GW(p)

Not just to FAI research. It also brings into question the entire idea that Bayesianism is the correct basis for epistemology.

Replies from: shminux

↑ comment by Shmi (shminux) · 2015-01-07T23:53:42.806Z · LW(p) · GW(p)

It would be interesting to see if Bayesian inference can be regularized), by imposing something like "meta-priors".

↑ comment by DanielLC · 2015-01-08T04:05:02.226Z · LW(p) · GW(p)

That can't be right. Regardless of priors, they'd agree on evidence ratios. If their priors are similar, then their evidence ratios are near unity, so their posteriors must also be similar.

Replies from: shminux

↑ comment by Shmi (shminux) · 2015-01-08T04:08:16.341Z · LW(p) · GW(p)

Consider reading the paper.

Replies from: DanielLC

↑ comment by DanielLC · 2015-01-08T04:17:56.853Z · LW(p) · GW(p)

From skimming the paper, it looks like the issue is how they're defining closeness of models. They consider a fair coin and a coin that lands on heads 51% of the time to be close, even though the prior probability of a billion consecutive heads is very different under each of those models. I would consider those two models to be distant, perhaps infinitely so. One assumes that the coin is fair, and the other assumes that it is not. Close models would give similar probabilities to fair coins.

Replies from: roystgnr, shminux

↑ comment by roystgnr · 2015-01-08T16:00:14.713Z · LW(p) · GW(p)

Not sure why you're being downvoted; the metric used to define "similar" or "closeness" is absolutely what's at issue here. Their choice of metric doesn't care very much about falsely assigning a probability of zero to a hypothesis, and Bayesian inference does care very much about whether you falsely assign a probability of zero to a hypothesis.

That being said, I won't consider this a complete rebuttal until I see someone listing metrics under which Bayesian inference is well-posed and we can see if any of them are useful. Energy distance is a nice one for practical reasons, for example; does it also play well with Bayes?

Replies from: alienist, DanielLC, Richard_Kennaway

↑ comment by alienist · 2015-01-09T02:38:10.613Z · LW(p) · GW(p)

Not sure why you're being downvoted; the metric used to define "similar" or "closeness" is absolutely what's at issue here.

Any metric whereby a 51% percent coin isn't close to a fair coin is useless in practice.

Replies from: roystgnr

↑ comment by roystgnr · 2015-01-12T17:15:01.459Z · LW(p) · GW(p)

I don't understand you. Neither "a 51% percent coin" nor "a fair coin" are probability distributions, and the choice of metric in question is "metric on spaces of probability distributions". Could you clarify?

Although, I could take your statement at face value, too. Want to make a few million $1 bets with me? We'll either be using "rand < .5" or "rand < .51" to decide when I win; since trying to distinguish between the two is useless you don't need to bother.

Replies from: Lumifer

↑ comment by Lumifer · 2015-01-12T17:21:16.357Z · LW(p) · GW(p)

Neither "a 51% percent coin" nor "a fair coin" are probability distributions

Of course they are, they represent Bernoulli distributions.

Replies from: roystgnr

↑ comment by roystgnr · 2015-01-13T14:04:06.716Z · LW(p) · GW(p)

You could call them Bernoulli distributions representing aleatory uncertainty on a single coin flip, I suppose. Bayesian updates of purely aleatory uncertainty aren't very interesting, though, are they? Your evidence is "I looked at it, it's heads", and your posterior is "It was heads that time".

I suppose you could add some uncertainty to the evidence; maybe we're looking at a coin flip through a blurry telescope? But in any context, Bernoulli distributions from a finite-dimensional probability distribution space mean that Bayesian updates on them are still well-posed. The concern here is that infinite-dimensional spaces of probability distributions don't always lead to well-posed Bayesian updates, depending on what metric you use to define well-posed. If there's also a concern that this can happen on Bernoulli distributions then I'd like to see an example; if not then that's a red herring.

Replies from: roystgnr, Lumifer

↑ comment by roystgnr · 2015-01-15T04:06:05.242Z · LW(p) · GW(p)

I also don't understand the downvote. Is there a single sentence in the above post that's mistaken? If so then a correction would be appreciated.

↑ comment by Lumifer · 2015-01-13T15:52:24.460Z · LW(p) · GW(p)

Also, once you are not limited to a single flip and can flip the coins multiple times, you graduate to binomial distributions which are highly useful and for which Bayesian updates are sufficiently interesting :-)

↑ comment by DanielLC · 2015-01-08T23:25:57.685Z · LW(p) · GW(p)

The maximum of the absolute value of the log of the ratio between the probability of a given hypothesis on each prior. That is the log of the highest possible odds of a piece of evidence that brings you from one prior to the other.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2015-01-09T13:31:06.410Z · LW(p) · GW(p)

I'm unclear on your terminology. I take a prior to be a distribution over distributions; in practice, usually a distribution over the parameters of a parameterised family. Let P1 and P2 be two priors of this sort, distributions over some parameter space Q. Write P1(q) for the probability density at q, and P1(x|q) for the probability density at x for parameter q. x varies over the data space X.

Is the distance measure you are proposing max_{q in Q} abs log( P1(q) / P2(q) )?

Or is it max_{q in Q,x in X} abs log( P1(x|q) / P2(x|q) )?

Or max_{q in Q,x in X} abs log( (P1(q)P1(x|q)) / (P2(q)P2(x|q)) )?

Or something else?

Replies from: DanielLC

↑ comment by DanielLC · 2015-01-09T20:48:04.053Z · LW(p) · GW(p)

A distribution over distributions just becomes a distribution. Just use P(x) = integral_{p} P(x|q)P(q)dq. The distance I'm proposing is max_x abs log(P1(x) / P2(x)) = max_x abs (log(integral_{p} P1(x|q) P1(q) dq) - integral_p P2(x|q) P2(q) dq)).

I think it might be possible to make this better. If Alice and Bob both agree that x is unlikely, then both disagreeing about the probability seems like less of a problem. For example, if Alice thinks it's one-in-a-million, and Bob think it's one-in-a-billion, then Alice would need a thousand-to-one evidence ratio to believe what Bob believes which means that that piece of evidence has a one-in-a-thousand chance of occurring, but since it only has a one-in-a-million chance of being needed, that doesn't matter much. It seems like it would only make a one-in-a-thousand difference. If you do it this way, it would need to be additive, but the distance is still at most the metric I just gave.

The metric for this would be:

integral_x log(max(P1(x), P2(x)) max(P1(x) / P2(x), P2(x) / P1(x)))

= integral_x log(max(P1^2(x) / P2(x), P2^2(x) / P1(x)))

↑ comment by Richard_Kennaway · 2015-01-08T21:28:38.657Z · LW(p) · GW(p)

I think energy distance doesn't work: the "notched" distributions that their work uses lie close to the original distribution in that distance, as they do for total variation and Prokhorov distance. I am guessing that Kullback-Leibler doesn't work either, provided the notches don't go all the way to zero. You just make the notch low enough to get a high probability for the desired posterior, then make it narrow enough to reduce KL divergence as low as you want.

If it is assumed that the observations are only made to finite precision (e.g. each observation takes the form of a probability distribution of entropy bounded from below) it's not clear to me what happens to their results. In terms of their examples, they depend on being able to narrow the notch arbitrarily and still contain the observed data with certainly. That can't be done if the data are only known with bounded precision.

↑ comment by Shmi (shminux) · 2015-01-08T04:21:27.001Z · LW(p) · GW(p)

No coin is 100% fair.

comment by Richard_Kennaway · 2015-01-09T19:15:57.290Z · LW(p) · GW(p)

My impression is that implicitly relying on arbitrary precision of a prior can give updates that are diametrically opposed to the ones you'd get with different, but arbitrarily similar priors.

I'm not sure what the "precision of a prior" means. A prior is an expression of the knowledge you have before obtaining the data. It is not something that is a measurement of something else, which it could be a more or less precise measurement of.

Has anyone produced a scenario in which the brittleness phenomenon arises in realistic practice?

Replies from: Anders_H

↑ comment by Anders_H · 2015-01-12T22:10:25.966Z · LW(p) · GW(p)

Precision is the reciprocal of the variance. In other words, you can use it as a measure of spread. If you are relatively certain that the true value of a parameter is in a narrow range, your prior will have low variance / high precision. If you think the true value may lie in a broader range, you have high variance / low precision.

comment by SanguineEmpiricist · 2015-01-08T20:19:48.728Z · LW(p) · GW(p)

I refuse to entertain the idea that bayesian should be called what it is and not inverse probability. Bayesians ego's are overinflated on their own hot air. Kolmogorov comes first.

Protocols of Anti-Bayes:

http://vserver1.cscs.lsa.umich.edu/~crshalizi/weblog/601.html

http://projecteuclid.org/euclid.aos/1176349830

http://vserver1.cscs.lsa.umich.edu/~crshalizi/notebooks/bayesian-consistency.html

http://link.springer.com/article/10.1007%2FBF00535479

and how could I forget this masterpiece?

http://repository.cmu.edu/cgi/viewcontent.cgi?article=1382&context=philosophy

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2015-01-09T13:15:19.850Z · LW(p) · GW(p)

What do you think of this and this?

Replies from: SanguineEmpiricist

↑ comment by SanguineEmpiricist · 2015-01-09T23:03:59.246Z · LW(p) · GW(p)

Yes I have read those. The Judea Pearl part of my standard showpiece. I love See Gelman's blog and Mayo's older book or perhaps recent book further criticisms. Gelman has collaborated with Mayo and Cosma.

Comments on "When Bayesian Inference Shatters"?

Contents

31 comments