Bayes' rule =/= Bayesian inference

neq1

Bayes' rule =/= Bayesian inference

post by neq1 · 2010-09-16T06:34:08.815Z · LW · GW · Legacy · 70 comments

  Bayes' Theorem
  Diagnostic testing study
  Frequentist approach
  Bayesian approach
  Example that illustrates benefit of Bayesian data analysis
  Frequentist approach
  Bayesian approach
None
70 comments

(Bayes' theorem is something Bayesians need to use more often than Frequentists do, but Bayes' theorem itself isn't Bayesian. This post is meant to be a light introduction to the difference between Bayes' theorem and Bayesian data analysis.)

Bayes' Theorem

Bayes' theorem is just a way to get (e.g.) p(B|A) from p(A|B) and p(B). The classic example of Bayes' theorem is diagnostic testing. Suppose someone either has the disease (D⁺) or does not have the disease (D^-) and either tests positive (T⁺) or tests negative (T^-). If we knew the sensitivity P(T⁺|D⁺), specificity P(T^-|D^-) and disease prevalence P(D⁺), then we could get the positive predictive value P(D⁺|T⁺) using Bayes' theorem:

For example, suppose we know the sensitivity=0.9, specificity=0.8 and disease prevalence is 0.01. Then,

This answer is not Bayesian or frequentist; it's just correct.

Diagnostic testing study

Typically we will not know P(T⁺|D⁺) or P(T^-|D^-). We would consider these unknown parameters. Let's denote them by Θ_sens and Θ_spec. For simplicity, let's assume we know the disease prevalence P(D⁺) (we often have a lot of data on this).

Suppose 1000 subjects with the disease were tested, and 900 of them tested positive. Suppose 1000 disease-free subjects were tested and 200 of them tested positive. Finally, suppose 1% of the population has the disease.

Frequentist approach

Estimate the 2 parameters (sensitivity and specificity) using their sample values (sample proportions) and plug them in to Bayes' formula above. This results in a point estimate for P(D⁺|T⁺) of 0.043. A standard error or confidence interval could be obtained using the delta method or bootstrapping.

Even though Bayes' theorem was used, this is not a Bayesian approach.

Bayesian approach

The Bayesian approach is to specify prior distributions for all unknowns. For example, we might specify independent uniform(0,1) priors for Θ_sens and Θ_spec. However, we should expect the test to do at least as good as guessing (guessing would mean randomly selecting 1% of people and calling them T⁺). In addition, we expect Θ_sens>1-Θ_spec. So, I might go with a Beta(4,2.5) distribution for Θ_sensand Beta(2.5,4) for Θ_spec:

Using these priors + the data yields a posterior distribution for P(D+|T+) with posterior median 0.043 and 95% credible interval (0.038, 0.049). In this case, the Bayesian and frequentist approaches have the same results (not surprising since the priors are relatively flat and there are a lot of data). However, the methodology is quite different.

Example that illustrates benefit of Bayesian data analysis

(example edited to focus on credible/confidence intervals)

Suppose someone shows you what looks like a fair coin (you confirm head on one side tails on the other) and makes the claim: "This coin will land with heads up 90% of the time"

Suppose the coin is flipped 5 times and lands with heads up 4 times.

Frequentist approach

"A 95% confidence interval for the Binomial parameter is (.38, .99) using the Agresti-Coull method." Because 0.9 is within the confidence limits, the usual conclusion would be that we do not have enough evidence to rule it out.

Bayesian approach

"I don't believe you. Based on experience and what I know about the laws of physics, I think it's very unlikely that your claim is accurate. I feel very confident that the probability is close to 0.5. However, I don't want to rule out something a little bit unusual (like a probability of 0.4). Thus, my prior for the probability of heads is a Beta(30,30) distribution."

After seeing the data, we update our belief about the binomial parameter. The 95% credible interval for it is (0.40, 0.64). Thus, a value of 0.9 is still considered extremely unlikely.

This illustrates the idea that, from a Bayesian perspective, implausible claims require more evidence than plausible claims. Frequentists have no formal way of including that type of prior information.

70 comments

Comments sorted by top scores.

comment by datadataeverywhere · 2010-09-17T05:45:38.522Z · LW(p) · GW(p)

I see so much on the site about Bayesian probability. Much of my current work uses Dempster-Shafer theory, which I haven't seen mentioned here.

DST is a generalization of Bayesian probability, and both fuzzy logic and Bayesian inference can be perfectly derived from DST. The most obvious difference is that DST parameterizes confidence, so that a 0.5 prior with no support is treated differently than a 0.5 prior with good support. For my work, the more important aspect is that DST is more forgiving when my sensors lie to me; it handles conflicting evidence more gracefully, as long as its results are correctly interpreted (in my opinion they are less intuitive than strict probabilities).

comment by Perplexed · 2010-09-16T17:17:58.525Z · LW(p) · GW(p)

Joseph Felsenstein is a pioneer in the use of maximum likelihood methods in evolutionary biology. In his book, "Inferring Phylogenies", he has a chapter on Bayesian methods, and a section in that chapter on controversies over Bayesian inference. He discusses a toy example of a space probe to Mars which looks for little green men and doesn't find them. He wonders whether a scientist whose prior for little green men involved odds of 1/4, and who, based on the evidence of the space probe, now assigns odds of 1/12, should publish those revised odds. He writes:

It might be argued that the correct thing to do in such a case is to publish the likelihood ratio 1/3 and let the reader provide their own prior. This is the likelihoodist position. A Bayesian is defined, not by using a prior, but by being willing to use a controversial prior.

Felsenstein apparently defines himself as a "likelihoodist" rather than a "frequentist" or "Bayesian".

Replies from: Cyan, TobyBartels, Eliezer_Yudkowsky, JGWeissman

↑ comment by Cyan · 2010-09-16T21:07:24.493Z · LW(p) · GW(p)

"Likelihoodist" is so clunky and dull. I prefer "likelihoodlum" -- it's just as clunky, but at least it's somewhat inflammatory.

↑ comment by TobyBartels · 2010-09-16T18:31:54.178Z · LW(p) · GW(p)

There are two slightly different meanings of what it is to be a "Bayesian": philosophically, there is a Bayesian interpretation of probability theory, and practically, there are Bayesian methods in statistics. I see Felsenstein as saying that, even if one is a Bayesian philosophically, one ought to practise as a likelihoodist.

In original research, I agree; there is not much point in reporting posteriors. Certainly there's no point in reporting them without also reporting the original priors, but better just to report the likelihoods and let readers supply their own priors.

On the other hand, in summaries for a broad readership, the posteriors are the most important result to report. Now most readers don't have the expertise to bring their own priors, so you have to give them yours. And then do the calculation for them.

Replies from: orthonormal, Perplexed

↑ comment by orthonormal · 2010-09-16T23:57:08.039Z · LW(p) · GW(p)

On the other hand, in summaries for a broad readership, the posteriors are the most important result to report. Now most readers don't have the expertise to bring their own priors, so you have to give them yours. And then do the calculation for them.

Good point. It would be irresponsible to publish a news item that "the Prime Minister's support for this bill is three times more likely if he is, in fact, a lizard alien than if he is a human" without noting that the prior probability for him being a lizard alien is pretty low.

Replies from: pozorvlak

↑ comment by pozorvlak · 2011-06-10T10:39:51.664Z · LW(p) · GW(p)

And yet they do this all the frigging time in medical stories, as documented extensively on, for instance, Bad Science.

↑ comment by Perplexed · 2010-09-16T18:48:33.078Z · LW(p) · GW(p)

I'm curious, though, as to what all you are giving up by not talking about priors. In Felsenstein's field - roughly, constructing best estimates of the "tree of life" - you very frequently have prior information which you want to bring to the problem, but of course you don't want to bring in any prior information which is not neutral on the controversial issue that your study is supposed to shed light on.

One of the advantages of a Bayesian methodology is supposed to be the ability to combine information from sources with different qualities and coverages. To what extent are you prevented from doing that if you insist on doing all of your likelihood ratio work behind a "veil of ignorance"?

Replies from: TobyBartels

↑ comment by TobyBartels · 2010-09-16T19:43:14.729Z · LW(p) · GW(p)

you very frequently have prior information which you want to bring to the problem, but of course you don't want to bring in any prior information which is not neutral on the controversial issue that your study is supposed to shed light on

Well, let's be very explicit about that then. A good report will:

state the relevant priors used for everything that the study is not directly about but which are still relevant,
remain quiet about priors for what the study is directly about, giving likelihood ratios instead.

More mathematically, suppose that you make certain assumptions A which, in full completeness, are not just things like "I assume that a certain sample has been dated correctly." but "I put the following probability distribution on the dates of this sample." This is very lengthy, which is the inconvenient part; although if you make simplified assumptions for purposes of your calculations, then you would put simplified assumptions in your text too. So it shouldn't really be any more inconvenient than whatever goes into your analysis.

But what you are testing is not A, but some hypothesis H (that the ancestors of Homo and Pan split after they split from Gorilla, for example; notice that this only makes sense if A includes that these three genera are really clades and that evolution of these animals really is a branching tree, although these are pretty common assumptions). And you have some evidence E.

Then in addition to A (which goes into your introduction, or maybe your appendix; anyway, it's logically prior to examining E), you also report the likelihood ratio P(E|A&H)/P(E|A&!H), which goes into your conclusion. Then maybe you also state P(H|A) and calculate P(H|A&E), just in case people want to read about that, but that is not really your result.

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2010-09-16T21:11:37.289Z · LW(p) · GW(p)

"Maximum likelihood" totally != "report likelihood ratios".

Replies from: Perplexed

↑ comment by Perplexed · 2010-09-16T21:32:12.016Z · LW(p) · GW(p)

Yes, I know, as I'm sure does Felsenstein. The book covered much more than maximum likelihood. The recommendation to report likelihood ratios came in the first of two chapters on Bayesian methods. The second involved hidden Markov models.

The book begins (as does the field) with a tree-building method called 'maximum parsimony'. Maximum likelihood is a step up in sophistication from that, and Felsenstein is largely responsible for that step forward. I'm not really sure why he is not an enthusiastic Bayesian. My guess would be that it is because he is a professional statistician and the whole discipline of statistics traditionally consists of ways of drawing totally objective conclusions from data.

↑ comment by JGWeissman · 2010-09-16T17:33:00.573Z · LW(p) · GW(p)

This position has also been expressed here.

(Should this article be migrated from OB to LW?)

comment by Daniel_Burfoot · 2010-09-16T14:07:14.096Z · LW(p) · GW(p)

Good post. It might be worth mentioning, after noting the benefit of the Bayesian approach, the deep problem with it: the subjectivity of the prior. This can be illustrated by looking at one of the above paragraphs:

Based on experience and what I know about the laws of physics, I think it's very unlikely that your claim is accurate. I feel very confident that the probability is close to 0.5. However, I don't want to rule out something a little bit unusual (like a probability of 0.4).

It's fine for Bayesian A to choose a prior based on his experience, feelings, and intuitive reasoning. But it's equally reasonable for Bayesian B to assert a completely different prior for her own reasons. Then they will come to different conclusions based on identical evidence.

Replies from: jimrandomh

↑ comment by jimrandomh · 2010-09-16T15:28:25.829Z · LW(p) · GW(p)

It's fine for Bayesian A to choose a prior based on his experience, feelings, and intuitive reasoning. But it's equally reasonable for Bayesian B to assert a completely different prior for her own reasons. Then they will come to different conclusions based on identical evidence.

Except that they're actually using different evidence, because the intuitive feelings that go into a prior are, in fact, evidence. More specifically, the prior is the sum of the outputs of all the heuristics that aren't quite strong or reliable enough to reach conscious awareness, but which nevertheless do, and ought to, influence the probability judgment.

comment by Oscar_Cunningham · 2010-09-16T08:37:31.622Z · LW(p) · GW(p)

I like this post, there's still a lot of confusion around Bayesian methods.

Two things that would have helped me while I was learning Bayesianism were that:

The frequency with which a coin comes up heads isn't a probability, no matter how much it looks like one.*

and

Bayes theorem doesn't do parameter estimation.

I might write these into a post sometime.

*This is what's going wrong in the heads of people who say things like "The probability is either 1 or 0, but I don't know which."

Replies from: TobyBartels, cousin_it

↑ comment by TobyBartels · 2010-09-16T18:04:33.790Z · LW(p) · GW(p)

The frequency with which a coin comes up heads isn't a probability, no matter how much it looks like one.

Pedantry alert: This is not technically true, although it's still a very important point.

Every frequency is the probability of something; in this case, the frequency with which the coin comes up heads is the probability, given that you pick one of the times that the coin is flipped, that coin comes up heads that time.

But this is not the same thing as the probability that the coin comes up heads the next time that you flip it, which is what you are more likely to be interested in (and which people are liable to uselessly claim is "either 1 or 0, but I don't know which").

Replies from: wedrifid, Oscar_Cunningham

↑ comment by wedrifid · 2010-09-16T18:37:44.736Z · LW(p) · GW(p)

Even more pedantic: It still isn't a probability.

There is a probability p(heads | you picked one of those coins) and it can be found by simply taking the frequency. But the frequency still doesn't mean the probability. In much the same way 5 / 100 balls in the jar being red isn't a probability. It is a curious fact about the colors of balls in the jar. p(ball is red | I take a ball from that jar) is a probability.

Replies from: TobyBartels

↑ comment by TobyBartels · 2010-09-16T18:46:08.517Z · LW(p) · GW(p)

Our think that our pedantries are clashing on the word "is".

I'm thinking of both frequencies and probabilities as numbers, and using "is" between them if they are equal numbers. You are (I guess) thinking of frequencies and probabilities are things of different types, which are not numbers even though they may be measured by numbers.

Come to think of it, your interpretation is more pedantic than mine, so I concede.

Replies from: Oscar_Cunningham

↑ comment by Oscar_Cunningham · 2010-09-17T16:53:42.543Z · LW(p) · GW(p)

Thinking about it further, there is no probability which is even numerically equal to the frequency. Probabilities are subjective, you know them or can work them out in your head. But you don't know the frequency, so it can't be equal to any of the probabilities in your head (except by coincidence).

Replies from: TobyBartels

↑ comment by TobyBartels · 2010-09-17T23:55:43.039Z · LW(p) · GW(p)

I think that it's a mistake to reserve the term ‘probability’ for beliefs held by actual people (or other beings with beliefs). In fact, since actual people are subject to such pervasive epistemic biases (such as we try to overcome here), I doubt that anybody (even readers of Less Wrong) holds actual beliefs that obey the mathematical laws of probability.

I prefer to think of probabiliy as the belief of an ideal rational being with given information / evidence / observations. (This makes me what they call an ‘objective Bayesian’, although really it just pushes the subjectivity back to the level of information.) So even if nobody knows the frequency with which a given coin comes up heads (which is certainly true if the coin is still around and may be flipped in the future), I can imagine a rational being who knows that frequency.

But in a post that was supposed to be pedantic, I was remiss in not specifying exactly what information the probability depends on!

↑ comment by Oscar_Cunningham · 2010-09-17T11:57:20.146Z · LW(p) · GW(p)

Thanks, this clears some things up for me.

Replies from: TobyBartels

↑ comment by TobyBartels · 2010-09-17T23:41:55.814Z · LW(p) · GW(p)

You're welcome!

↑ comment by cousin_it · 2010-09-16T09:44:20.326Z · LW(p) · GW(p)

I don't understand how you can hold a position like that and still enjoy the post. How do you parse the phrase "my prior for the probability of heads" in the second example?

Replies from: neq1, Oscar_Cunningham

↑ comment by neq1 · 2010-09-16T13:31:36.167Z · LW(p) · GW(p)

In the second example the person was speaking informally, but there is nothing wrong with specifying a probability distribution for an unknown parameter (and that parameter could be a probability for heads)

↑ comment by Oscar_Cunningham · 2010-09-16T11:28:11.901Z · LW(p) · GW(p)

I hadn't seen that, but you're right that that sentence is wrong. "Probability" should have been replaced with "frequency" or something. A prior on a probability would be a set of probabilities of probabilities, and would soon lead to infinite regress.

Replies from: neq1

↑ comment by neq1 · 2010-09-16T13:33:14.334Z · LW(p) · GW(p)

only if you keep specifying hyper-priors, which there is no reason to do

Replies from: Oscar_Cunningham

↑ comment by Oscar_Cunningham · 2010-09-16T14:50:26.870Z · LW(p) · GW(p)

Exactly. There's no point in the first meta-prior either.

comment by MC_Escherichia · 2010-09-16T10:55:34.542Z · LW(p) · GW(p)

If the null hypothesis was true, the probability that we would get 3 heads or less is 0.08

Is the idea that the coin will land heads 90% of the time really something that can be called the "null hypothesis"?

Replies from: Matt_Simpson, neq1

↑ comment by Matt_Simpson · 2010-09-16T15:33:09.971Z · LW(p) · GW(p)

Statistically, there's nothing wrong with the null hypothesis being p=0.9. It's probably not a test you would see very often in practice because usually there is nothing interesting about p=0.9. But if you wanted to test whether or not p=0.9 for some reason - any reason, setting the null hypothesis as p=0.9 is a perfectly valid (frequentist) way of doing it.

↑ comment by neq1 · 2010-09-16T12:07:51.799Z · LW(p) · GW(p)

Hm, good point. Since the usual thing is .5, the claim should be the alternative. I was thinking in terms of trying to reject their claim (which it wouldn't take much data to do), but I do think my setup was non-standard. I'll fix it later today

comment by [deleted] · 2010-09-16T07:48:58.602Z · LW(p) · GW(p)

I don't quite get how you got your distributions in the first example. Since getting the prior distributions seems to be where the magic happens, further explaination would be helpful.

comment by PhilGoetz · 2010-09-16T20:09:23.067Z · LW(p) · GW(p)

THANK YOU! That's the best explanation I've ever seen of the difference. I don't know if it's right; but at least it's making a coherent claim.

Can you spell out how the computation is done with the priors in the Bayesian case?

Quibble:

However, we should expect the test to do at least as good as guessing (guessing would mean randomly selecting 1% of people and calling them T+).

Guessing that everyone is T- would have a lower error rate.

Replies from: datadataeverywhere

↑ comment by datadataeverywhere · 2010-09-17T05:32:35.802Z · LW(p) · GW(p)

Guessing that everyone is T- results in a 100% false negative rate, which although not much better than a 99% false negative rate, might more than make up for a 1% decrease in the false positive rate.

If this is a real cancer test, and the researcher is optimizing a balance between false positives and false negatives, where would you prefer that he or she place that balance? A lot of medical tests have intentionally very low false negative rates even if that means they have proportionally much higher false positive rates (than they would if they were optimizing for a different balance).

comment by cousin_it · 2010-09-16T18:25:58.827Z · LW(p) · GW(p)

May be relevant here: my post and Wei Dai's post.

comment by CronoDAS · 2010-09-16T07:26:22.752Z · LW(p) · GW(p)

This illustrates the idea that, from a Bayesian perspective, implausible claims require more evidence than plausible claims. Frequentists have no formal way of including that type of prior information.

Yeah, the best they can do is require a stricter significance level. (IIRC, the justification for the standard "statistical significance" test involves something like taking the position of maximum ignorance by assuming that the null hypothesis is 50% likely to be true and then applying Bayes's Theorem to get a posterior probability...)

comment by JGWeissman · 2010-09-16T16:22:29.831Z · LW(p) · GW(p)

The key feature of Bayesian probability is the concept that probabilities are degrees of belief. The reason that Bayes' Theorem is associated with the Bayesian approach is that once you are willing, unlike frequentists, to have a probability that a coin comes up heads in 90% of flips, you can use Bayes' Theorem to update that probability on observing the evidence of 4 flips resulting in heads and 1 flip resulting in tails.

comment by arch1 · 2010-09-20T21:40:33.491Z · LW(p) · GW(p)

neq1,

Thanks for the great learning. One apparent inconsistency puzzled me a bit:

You go out of your way to say that you expect Θsens>1-Θspec (which looks sensible to me), then follow up by choosing a pair of priors which, assuming independence, appear to violate this expectation a significant fraction of the time - perhaps 20-25%, just eyeballing the distributions.

Can you comment on the apparent inconsistency, or straighten out my misconceptions?

comment by arch1 · 2010-09-18T06:53:24.741Z · LW(p) · GW(p)

Great post, thanks for the learning. I am a bit puzzled however as to why you go out of your way to say "In addition, we expect Θsens>1-Θspec," (which expectation BTW makes sense to me), only to follow this up by assuming priors for Θsens and Θspec which (assuming independence) violate this expectation with significant probability (perhaps 25% of the time, based on a crude eyeballing of the curves).

Can you help me better understand your thinking here (or clear up any misconception on my part)?

comment by PhilGoetz · 2010-09-16T20:15:04.674Z · LW(p) · GW(p)

Wait a minute - when the Bayesian says "I think the coin probably has a chance near 50% of being heads", she's using data from prior observations of coin flips to say that. Which means that the frequentist might get the same answer if he added those prior observations to his dataset.

Replies from: neq1, orthonormal

↑ comment by neq1 · 2010-09-17T10:07:04.533Z · LW(p) · GW(p)

Yes, that's a good point. Tthat would be considered using a data augmentation prior (Sander Greenland has advocated such an approach).

↑ comment by orthonormal · 2010-09-17T00:01:41.883Z · LW(p) · GW(p)

You can dismiss this objection by replacing the coin with a novel experimental test with an easily computed expected probability of success– say, the very first test of spin-up vs. spin-down for silver atoms.

Frequentists can't claim relevant data sets for every experiment that has an obvious prior, without engaging in their own form of reference class tennis.

Replies from: PhilGoetz

↑ comment by PhilGoetz · 2010-09-20T15:59:42.953Z · LW(p) · GW(p)

How can they have an obvious prior without an obvious relevant data set?

comment by TobyBartels · 2010-09-16T18:36:24.319Z · LW(p) · GW(p)

There's really no contradiction between the classical and Bayesian results:

Classical: we do not have enough evidence to rule out the claim about the coin's bias.
Bayesian: the claim about the coin's bias is still considered extremely unlikely.

They are saying different things.

Replies from: PhilGoetz

↑ comment by PhilGoetz · 2010-09-16T20:10:29.463Z · LW(p) · GW(p)

If the 95% credible intervals are different, I think they are in contradiction.

Replies from: Douglas_Knight

↑ comment by Douglas_Knight · 2010-09-16T23:14:06.220Z · LW(p) · GW(p)

The frequentist does not supply a credible interval, but a confidence interval. The credible interval has to do with the posterior P(H|D) (and thus the prior P(H)), which the frequentist refuses to talk about. The confidence interval has to do P(D|H) (with p-values and maybe likelihood ratios). Scientists often treat confidence intervals as credible intervals, and thus are wrong, but frequentist statisticians are more sophisticated. I'm not clear on what they say, though.

One advantage of the confidence interval is that it makes it clear that the test wasn't that powerful and that the bayesian is relying on the prior. The bayesian is not going to change the mind of the owner of the coin, who clearly has a different prior.

Incidentally, if someone claimed a coin produced 90% heads, I'd update away from such a sharply peaked distribution.

comment by JohnDavidBustard · 2010-09-16T15:56:35.111Z · LW(p) · GW(p)

Wow, this really brings home the arbitrary nature of the Bayesian approach. If we're trying to get an AI to determine what to do, it can't guess meaningful priors (and neither can we come to that). I presume when it is applied there is a load of theoretical approaches to prior model estimation or is a uniform prior just used as default? In which case are there other occasions when a frequentist and bayesians probability estimates differ?

Replies from: DSimon, TobyBartels

↑ comment by DSimon · 2010-09-16T17:19:47.281Z · LW(p) · GW(p)

Sure, if the priors are arbitrary, the Bayesian approach's output is arbitrary. But if the priors make sense, the Bayesian approach works. Or in other words: just like any other algorithm good or bad, GIGO.

Replies from: wedrifid, JohnDavidBustard

↑ comment by wedrifid · 2010-09-16T18:41:33.514Z · LW(p) · GW(p)

Sure, if the priors are arbitrary, the Bayesian approach's output is arbitrary.

Some would suggest that taking arbitrary priors isn't the Bayesian approach.

↑ comment by JohnDavidBustard · 2010-09-16T18:43:05.901Z · LW(p) · GW(p)

I suppose the question is, how to calculate the priors so they do make sense. In particular, how can an AI estimate priors. I'm sure there is a lot of existing work on this. The problem with making statements about priors that don't have a formal process for their calculation is that there is no basis for comparing two predictions. In the worst case, by adjusting the prior the resulting probabilities can be adjusted to any value. Making the approach a formal technique which is potentially just hiding the unknowns in the priors. In effect being no more reasonable because the priors are a guess.

Replies from: satt, jsalvatier

↑ comment by satt · 2010-09-16T19:04:00.991Z · LW(p) · GW(p)

In particular, how can an AI estimate priors. I'm sure there is a lot of existing work on this.

There is. For example, one can use the Jeffreys prior, which has the desirable property of being invariant under different parametrization choices, or one can pick a prior according to the maximum entropy principle, which says to pick the prior with the greatest entropy that satisfies the model constraints. I don't know if anyone's come up with a meta-rationale that justifies one of these approaches over all others (or explains when to use different approaches), though.

Replies from: JohnDavidBustard

↑ comment by JohnDavidBustard · 2010-09-16T19:23:37.270Z · LW(p) · GW(p)

Thank you, this is very interesting. I'm not sure of the etiquette, but I'm reposting a question from an old article, that I would really appreciate your thoughts on.

Is it correct, to say that the entropy prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?

If so I was wondering if it could be extended to reflect other aesthetics. For example, if an experiment produces a single result that is inconsistent with an existing simple physics theory, it may be that the simplest theory that explains this data is to treat this result as an isolated exception, however, aesthetically we find it more plausible that this exception is evidence of a larger theory that the sample is one part of.

In contrast when attempting to understand the rules of a human system (e.g. a bureaucracy) constructing a theory that lacked exceptions seems unlikely ("that's a little too neat"). Indeed when stated informally the phrase might go "in my experience, that's a little too neat" implying that we formulate priors based on learned patterns from experience. In the case of the bureaucracy, this may stem from a probabilistic understanding of the types of system that result from a particular 'maker' (i.e. politics).

However, this moves the problem to one of classifying contexts and determining which contexts are relevant, if this process is considered part of the theory, then it may considerably increase its complexity always preferring theories which ignore context. Unless of course the theory is complete (incorporating all contexts) in which case the simplest theory may share these contextual models and thus become the universal simplest model. It would therefore not be rational to apply Kolmogorov complexity to a problem in isolation. I.e. probability and reductionism are not compatible.

Replies from: satt

↑ comment by satt · 2010-09-16T20:41:53.880Z · LW(p) · GW(p)

With the disclaimer that I'm no expert and quite possibly wrong about some of this, here goes.

Is it correct, to say that the entropy prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?

No. Or, at least, that's not the conscious motivation for the maximum entropy principle (MAXENT). As I see it, the justification for MAXENT is that entropy measures the "uncertainty" the prior represents, and we should choose the prior that represents greatest uncertainty, because that means assuming the least possible additional information about the problem.

Now, it does sometimes happen that MAXENT tells you to pick a prior with what I'd guess you think of as "simpler structure". Suppose you're hiding in your fist a 6-sided die I know nothing about, and you ask me to give you my probability distribution for which side'll come up when you roll it. As I know nothing about the die, I have no basis for imposing additional constraints on the problem, so the only operative constraint is that P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1; given just that constraint, MAXENT says I should assign probability 1/6 to each side.

In that particular case, MAXENT gives a nice, smooth, intuitively pleasing result. But if we impose a new constraint, e.g. that the expected value of the die roll is 4.5 (instead of the 3.5 implied by the uniform distribution), MAXENT says the appropriate probability distribution is {0.054, 0.079, 0.114, 0.165, 0.240, 0.348} for sides 1 to 6 respectively (from here), which doesn't look especially simple to me. So for all but the most basic problems, I expect MAXENT doesn't conform to the "simpler structures" heuristic.

There is probably some definition of "simple" or "complex" that would make your heuristic equivalent to MAXENT, but I doubt it'd correspond to how we normally think of simplicity/complexity.

Replies from: JohnDavidBustard

↑ comment by JohnDavidBustard · 2010-09-16T21:08:56.556Z · LW(p) · GW(p)

Thank you, that's very interesting, and comforting.

↑ comment by jsalvatier · 2010-09-21T15:13:29.204Z · LW(p) · GW(p)

In statistics, I think 'weakly informative priors' are becoming more popular. Weakly informative priors are distributions like a t distribution (or normal) with a really wide standard deviation and low degrees of freedom. This allows us to avoid spending all out data on merely narrowing down the correct order of order of magnitude, which can be a problem in many problems using non-informative priors. It's almost never the case that we literally know nothing prior to the data.

Replies from: Matt_Simpson

↑ comment by Matt_Simpson · 2010-09-21T15:15:14.768Z · LW(p) · GW(p)

Using a normal with a massive variance is also a standard hack for getting a proper "uninformative" prior on the real line.

↑ comment by TobyBartels · 2010-09-16T18:33:34.018Z · LW(p) · GW(p)

Hopefully an AI will be able to get its hands on large amounts of data. Once it has that, it doesn't matter very much what its priors were.

Replies from: Jonathan_Graehl, JohnDavidBustard

↑ comment by Jonathan_Graehl · 2010-09-16T22:30:53.753Z · LW(p) · GW(p)

Agreed, but the priors can in principle be strong enough that hypothesis A will always be favored over B no matter how much data you have, even though B gives an orders of magnitude higher P(data|B) than P(data|A).

↑ comment by JohnDavidBustard · 2010-09-16T18:53:39.710Z · LW(p) · GW(p)

Is there a bound on the amount of data that is necessary to adjust a prior of a given error magnitude? Likewise, if the probability is the result of a changing system I presume it could well be the case that the pdf estimates will be consistently inaccurate as they are constantly adjusting to events whose local probability is changing. Does the Bayesian approach help, over say, model fitting to arbitrary samples? Is it, in effect, an example of a model fitting strategy no more reasonable than any other?

comment by Clippy · 2010-09-16T20:45:10.365Z · LW(p) · GW(p)

I asked this before (though perhaps with the wrong symbols), so I'll try it a different way.

Someone please evaluate this statement:

?equals(correct reasoning, Bayesian inference)

Replies from: Perplexed

↑ comment by Perplexed · 2010-09-17T01:00:34.583Z · LW(p) · GW(p)

There are a (pretty common) class of reasoning problems for which equals(correct reasoning, Bayesian inference) is widely believed here in LW. There are other problems and other forms of correct reasoning for which Bayesian inference is simply inapplicable.

For example, the following syllogism cannot be completed by Bayesian inference.

A. Nothing is better than total happiness.

B. A paper clip is better than nothing.

C. ???

Replies from: Clippy, Seth_Goldin

↑ comment by Clippy · 2010-09-17T15:26:45.023Z · LW(p) · GW(p)

That's not a very good example, because a paperclip is, in fact, better than happiness. At least, human::happiness.

Replies from: jsalvatier

↑ comment by jsalvatier · 2010-09-21T15:17:35.707Z · LW(p) · GW(p)

This is a great LW name.

↑ comment by Seth_Goldin · 2010-09-19T14:04:48.976Z · LW(p) · GW(p)

This is a fallacious amphiboly, so it's deductively wrong. There's no need to even bring up induction here, and Bayesian inference is for induction. It's a category error to criticize that Bayesian inference doesn't apply. It would be like asking Bayesian inference to cook me dinner.

comment by DanielLC · 2010-09-16T16:17:36.404Z · LW(p) · GW(p)

The way I think of it is that the Bayesian approach only uses Bayes' theorem. Frequentists also use P(A) = (number of times it was A)/(total number of times). The other branch that I don't remember the name of assumes that all probabilities are equal i.e. P(A) = P(B) = P(C) = ...

Replies from: orthonormal, ata

↑ comment by orthonormal · 2010-09-16T23:51:21.278Z · LW(p) · GW(p)

I can understand how you'd get that confusion from the names of the approaches, but you've got it rather wrong. Bayesians incorporate all evidence that frequentists use, including observed frequencies in large data sets; this results in their posterior distributions being centered very narrowly on the frequentists' point estimate.

Replies from: DanielLC

↑ comment by DanielLC · 2010-09-17T00:04:27.937Z · LW(p) · GW(p)

In large data sets the Bayesian method gets a similar answer, but it's not the same method. If you flip a coin once, and get heads, the frequentist method would say that the coin always lands on heads. The Bayesian method would never result in saying the coin always lands on heads unless it was assumed from the beginning.

Replies from: orthonormal

↑ comment by orthonormal · 2010-09-17T00:15:54.705Z · LW(p) · GW(p)

I didn't expect I'd end up saying this, but frequentists aren't that naive either.

Replies from: Oscar_Cunningham, DanielLC

↑ comment by Oscar_Cunningham · 2010-09-17T15:03:45.059Z · LW(p) · GW(p)

What does a frequentist do in this situation?

↑ comment by DanielLC · 2010-09-17T04:33:06.520Z · LW(p) · GW(p)

They won't use that method when it gives results that absurd, but that's still what the method says they should do.

↑ comment by ata · 2010-09-17T04:39:38.622Z · LW(p) · GW(p)

The other branch that I don't remember the name of assumes that all probabilities are equal i.e. P(A) = P(B) = P(C) = ...

Uh... is that a real thing?

Replies from: CronoDAS

↑ comment by CronoDAS · 2010-09-17T18:53:45.163Z · LW(p) · GW(p)

I think that was supposed to be a reference to the Principle of indifference...

Bayes' rule =/= Bayesian inference

Contents

70 comments