An Intuitive Explanation of Bayes's Theorem

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2003-01-01T20:00:30.058Z · LW · GW · 10 comments

Contents

  Q. Why did the Bayesian reasoner cross the road?

  Q. How can I find the priors for a problem?

  Q. Uh huh. Then where do scientists get their priors?

  
Q. I see. And where does everyone else get their priors?

  Q. What if the priors I want aren’t available on Kazaa?

None
10 comments

(Note: The author now considers this explanation obsoleted by the Bayes' Rule Guide.)

[Editor’s Note: This is an abridgement of the original version of this essay, which contained many interactive elements.]

Your friends and colleagues are talking about something called “Bayes’s Theorem” or “Bayes’s Rule,” or something called Bayesian reasoning. They sound really enthusiastic about it, too, so you google and find a web page about Bayes’s Theorem and . . .

It’s this equation. That’s all. Just one equation. The page you found gives a definition of it, but it doesn’t say what it is, or why it’s useful, or why your friends would be interested in it. It looks like this random statistics thing.

Why does a mathematical concept generate this strange enthusiasm in its students? What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case? What is the secret that the adherents of Bayes know? What is the light that they have seen?

Soon you will know. Soon you will be one of us.

While there are a few existing online explanations of Bayes’s Theorem, my experience with trying to introduce people to Bayesian reasoning is that the existing online explanations are too abstract. Bayesian reasoning is very counterintuitive. People do not employ Bayesian reasoning intuitively, find it very difficult to learn Bayesian reasoning when tutored, and rapidly forget Bayesian methods once the tutoring is over. This holds equally true for novice students and highly trained professionals in a field. Bayesian reasoning is apparently one of those things which, like quantum mechanics or the Wason Selection Test, is inherently difficult for humans to grasp with our built-in mental faculties.

Or so they claim. Here you will find an attempt to offer an intuitive explanation of Bayesian reasoning—an excruciatingly gentle introduction that invokes all the human ways of grasping numbers, from natural frequencies to spatial visualization. The intent is to convey, not abstract rules for manipulating numbers, but what the numbers mean, and why the rules are what they are (and cannot possibly be anything else). When you are finished reading this, you will see Bayesian problems in your dreams.

And let’s begin.


Here’s a story problem about a situation that doctors often encounter:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammograms. 9.6% of women without breast cancer will also get positive mammograms. A woman in this age group had a positive mammogram in a routine screening. What is the probability that she actually has breast cancer?

What do you think the answer is? If you haven’t encountered this kind of problem before, please take a moment to come up with your own answer before continuing.


Next, suppose I told you that most doctors get the same wrong answer on this problem—usually, only around 15% of doctors get it right. (“Really? 15%? Is that a real number, or an urban legend based on an Internet poll?” It’s a real number. See Casscells, Schoenberger, and Graboys 1978;[1] Eddy 1982;[2] Gigerenzer and Hoffrage 1995;[3] and many other studies. It’s a surprising result which is easy to replicate, so it’s been extensively replicated.)

On the story problem above, most doctors estimate the probability to be between 70% and 80%, which is wildly incorrect.

Here’s an alternate version of the problem on which doctors fare somewhat better:

10 out of 1,000 women at age forty who participate in routine screening have breast cancer. 800 out of 1,000 women with breast cancer will get positive mammograms. 96 out of 1,000 women without breast cancer will also get positive mammograms. If 1,000 women in this age group undergo a routine screening, about what fraction of women with positive mammograms will actually have breast cancer?

And finally, here’s the problem on which doctors fare best of all, with 46%— nearly half—arriving at the correct answer:

100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammogram. 950 out of 9,900 women without breast cancer will also get a positive mammogram. If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammograms will actually have breast cancer?


The correct answer is 7.8%, obtained as follows: Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammograms. From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also get positive mammograms. This makes the total number of women with positive mammograms 950 + 80 or 1,030. Of those 1,030 women with positive mammograms, 80 will have cancer. Expressed as a proportion, this is 80/1,030 or 0.07767 or 7.8%.

To put it another way, before the mammography, the 10,000 women can be divided into two groups:

  • Group 1: 100 women with breast cancer.

  • Group 2: 9,900 women without breast cancer.

Summing these two groups gives a total of 10,000 patients, confirming that none have been lost in the math. After the mammography, the women can be divided into four groups:

  • Group A: 80 women with breast cancer and a positive mammogram.

  • Group B: 20 women with breast cancer and a negative mammogram.

  • Group C: 950 women without breast cancer and a positive mammogram.

  • Group D: 8,950 women without breast cancer and a negative mammogram.

The sum of groups A and B, the groups with breast cancer, corresponds to group 1; and the sum of groups C and D, the groups without breast cancer, corresponds to group 2. If you administer a mammography to 10,000 patients, then out of the 1,030 with positive mammograms, eighty of those positive-mammogram patients will have cancer. This is the correct answer, the answer a doctor should give a positive-mammogram patient if she asks about the chance she has breast cancer; if thirteen patients ask this question, roughly one out of those thirteen will have cancer.


The most common mistake is to ignore the original fraction of women with breast cancer, and the fraction of women without breast cancer who receive false positives, and focus only on the fraction of women with breast cancer who get positive results. For example, the vast majority of doctors in these studies seem to have thought that if around 80% of women with breast cancer have positive mammograms, then the probability of a women with a positive mammogram having breast cancer must be around 80%.

Figuring out the final answer always requires all three pieces of information—the percentage of women with breast cancer, the percentage of women without breast cancer who receive false positives, and the percentage of women with breast cancer who receive (correct) positives.

The original proportion of patients with breast cancer is known as the prior probability. The chance that a patient with breast cancer gets a positive mammogram, and the chance that a patient without breast cancer gets a positive mammogram, are known as the two conditional probabilities. Collectively, this initial information is known as the priors. The final answer—the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammogram—is known as the revised probability or the posterior probability. What we’ve just seen is that the posterior probability depends in part on the prior probability.

To see that the final answer always depends on the original fraction of women with breast cancer, consider an alternate universe in which only one woman out of a million has breast cancer. Even if mammography in this world detects breast cancer in 8 out of 10 cases, while returning a false positive on a woman without breast cancer in only 1 out of 10 cases, there will still be a hundred thousand false positives for every real case of cancer detected. The original probability that a woman has cancer is so extremely low that, although a positive result on the mammogram does increase the estimated probability, the probability isn’t increased to certainty or even “a noticeable chance”; the probability goes from 1:1,000,000 to 1:100,000.

What this demonstrates is that the mammogram result doesn’t replace your old information about the patient’s chance of having cancer; the mammogram slides the estimated probability in the direction of the result. A positive result slides the original probability upward; a negative result slides the probability downward. For example, in the original problem where 1% of the women have cancer, 80% of women with cancer get positive mammograms, and 9.6% of women without cancer get positive mammograms, a positive result on the mammogram slides the 1% chance upward to 7.8%.

Most people encountering problems of this type for the first time carry out the mental operation of replacing the original 1% probability with the 80% probability that a woman with cancer gets a positive mammogram. It may seem like a good idea, but it just doesn’t work. “The probability that a woman with a positive mammogram has breast cancer” is not at all the same thing as “the probability that a woman with breast cancer has a positive mammogram”; they are as unlike as apples and cheese.


Q. Why did the Bayesian reasoner cross the road?

A. You need more information to answer this question.


Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl? For this example the arithmetic is simple enough that you may be able to do it in your head, and I would suggest trying to do so.

A more compact way of specifying the problem:

P (pearl) = 40%

P (blue|pearl) = 30%

P (blue|¬pearl) = 10%

P (pearl|blue) = ?

The symbol “¬” is shorthand for “not,” so ¬pearl reads “not pearl.”

The notation P (blue|pearl) is shorthand for “the probability of blue given pearl” or “the probability that an egg is painted blue, given that the egg contains a pearl.” The item on the right side is what you already know or the premise, and the item on the left side is the implication or conclusion. If we have P (blue|pearl) = 30%, and we already know that some egg contains a pearl, then we can conclude there is a 30% chance that the egg is painted blue. Thus, the final fact we’re looking for—“the chance that a blue egg contains a pearl” or “the probability that an egg contains a pearl, if we know the egg is painted blue”—reads P (pearl|blue).

40% of the eggs contain pearls, and 60% of the eggs contain nothing. 30% of the eggs containing pearls are painted blue, so 12% of the eggs altogether contain pearls and are painted blue. 10% of the eggs containing nothing are painted blue, so altogether 6% of the eggs contain nothing and are painted blue. A total of 18% of the eggs are painted blue, and a total of 12% of the eggs are painted blue and contain pearls, so the chance a blue egg contains a pearl is 12/18 or 2/3 or around 67%.

As before, we can see the necessity of all three pieces of information by considering extreme cases. In a (large) barrel in which only one egg out of a thousand contains a pearl, knowing that an egg is painted blue slides the probability from 0.1% to 0.3% (instead of sliding the probability from 40% to 67%). Similarly, if 999 out of 1,000 eggs contain pearls, knowing that an egg is blue slides the probability from 99.9% to 99.966%; the probability that the egg does not contain a pearl goes from 1/1,000 to around 1/3,000.

On the pearl-egg problem, most respondents unfamiliar with Bayesian reasoning would probably respond that the probability a blue egg contains a pearl is 30%, or perhaps 20% (the 30% chance of a true positive minus the 10% chance of a false positive). Even if this mental operation seems like a good idea at the time, it makes no sense in terms of the question asked. It’s like the experiment in which you ask a second-grader: “If eighteen people get on a bus, and then seven more people get on the bus, how old is the bus driver?” Many second-graders will respond: “Twenty-five.” They understand when they’re being prompted to carry out a particular mental procedure, but they haven’t quite connected the procedure to reality. Similarly, to find the probability that a woman with a positive mammogram has breast cancer, it makes no sense whatsoever to replace the original probability that the woman has cancer with the probability that a woman with breast cancer gets a positive mammogram. Neither can you subtract the probability of a false positive from the probability of the true positive. These operations are as wildly irrelevant as adding the number of people on the bus to find the age of the bus driver.


A study by Gigerenzer and Hoffrage in 1995 showed that some ways of phrasing story problems are much more evocative of correct Bayesian reasoning.[4] The least evocative phrasing used probabilities. A slightly more evocative phrasing used frequencies instead of probabilities; the problem remained the same, but instead of saying that 1% of women had breast cancer, one would say that 1 out of 100 women had breast cancer, that 80 out of 100 women with breast cancer would get a positive mammogram, and so on. Why did a higher proportion of subjects display Bayesian reasoning on this problem? Probably because saying “1 out of 100 women” encourages you to concretely visualize X women with cancer, leading you to visualize X women with cancer and a positive mammogram, etc.

The most effective presentation found so far is what’s known as natural frequencies—saying that 40 out of 100 eggs contain pearls, 12 out of 40 eggs containing pearls are painted blue, and 6 out of 60 eggs containing nothing are painted blue. A natural frequencies presentation is one in which the information about the prior probability is included in presenting the conditional probabilities. If you were just learning about the eggs’ conditional probabilities through natural experimentation, you would—in the course of cracking open a hundred eggs—crack open around 40 eggs containing pearls, of which 12 eggs would be painted blue, while cracking open 60 eggs containing nothing, of which about 6 would be painted blue. In the course of learning the conditional probabilities, you’d see examples of blue eggs containing pearls about twice as often as you saw examples of blue eggs containing nothing.

Unfortunately, while natural frequencies are a step in the right direction, it probably won’t be enough. When problems are presented in natural frequencies, the proportion of people using Bayesian reasoning rises to around half. A big improvement, but not big enough when you’re talking about real doctors and real patients.


Q. How can I find the priors for a problem?

A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.

Q. Where do priors originally come from?

A. Never ask that question.

Q. Uh huh. Then where do scientists get their priors?

A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has become fractious and controversial, with widespread acrimony, factional polarization, and several outright assassinations. This may be a front for infighting within the Bayes Council, or it may be that the disputants have too much spare time. No one is really sure.

Q. I see. And where does everyone else get their priors?

A. They download their priors from Kazaa.

Q. What if the priors I want aren’t available on Kazaa?

A. There’s a small, cluttered antique shop in a back alley of San Francisco’s Chinatown. Don’t ask about the bronze rat.

Actually, priors are true or false just like the final answer—they reflect reality and can be judged by comparing them against reality. For example, if you think that 920 out of 10,000 women in a sample have breast cancer, and the actual number is 100 out of 10,000, then your priors are wrong. For our particular problem, the priors might have been established by three studies—a study on the case histories of women with breast cancer to see how many of them get a positive mammogram, a study on women without breast cancer to see how many of them get a positive mammogram, and an epidemiological study on the prevalence of breast cancer in some specific demographic.


The probability P (A, B) is the same as P (B, A), but P (A|B) is not the same thing as P (B|A), and P (A, B) is completely different from P (A|B). It’s a common confusion to mix up some or all of these quantities.

To get acquainted with all the relationships between them, we’ll play “follow the degrees of freedom.” For example, the two quantities P (cancer) and P (¬cancer) have one degree of freedom between them, because of the general law P (A) + P (¬A) = 1. If you know that P (¬cancer) = 0.99, you can obtain P (cancer) = 1 − P (¬cancer) = 0.01.

The quantities P (positive|cancer) and P (¬positive|cancer) also have only one degree of freedom between them; either a woman with breast cancer gets a positive mammogram or she doesn’t. On the other hand, P (positive|cancer) and P (positive|¬cancer) have two degrees of freedom. You can have a mammography that returns positive for 80% of cancer patients and 9.6% of healthy patients, or that returns positive for 70% of cancer patients and 2% of healthy patients, or even a health test that returns “positive” for 30% of cancer patients and 92% of healthy patients. The two quantities, the output of the mammography for cancer patients and the output of the mammography for healthy patients, are in mathematical terms independent; one cannot be obtained from the other in any way, and so they have two degrees of freedom between them.

What about P(positive,cancer), P(positive|cancer), and P(cancer)? Here we have three quantities; how many degrees of freedom are there? In this case the equation that must hold is

P (positive, cancer) = P (positive|cancer) × P (cancer) .

This equality reduces the degrees of freedom by one. If we know the fraction of patients with cancer, and the chance that a cancer patient has a positive mammogram, we can deduce the fraction of patients who have breast cancer and a positive mammogram by multiplying.

Similarly, if we know the number of patients with breast cancer and positive mammograms, and also the number of patients with breast cancer, we can estimate the chance that a woman with breast cancer gets a positive mammogram by dividing: P (positive|cancer) = P (positive, cancer)/P (cancer). In fact, this is exactly how such medical diagnostic tests are calibrated; you do a study on 8,520 women with breast cancer and see that there are 6,816 (or thereabouts) women with breast cancer and positive mammograms, then divide 6,816 by 8,520 to find that 80% of women with breast cancer had positive mammograms. (Incidentally, if you accidentally divide 8,520 by 6,816 instead of the other way around, your calculations will start doing strange things, such as insisting that 125% of women with breast cancer and positive mammograms have breast cancer. This is a common mistake in carrying out Bayesian arithmetic, in my experience.) And finally, if you know P (positive, cancer) and P (positive|cancer), you can deduce how many cancer patients there must have been originally. There are two degrees of freedom shared out among the three quantities; if we know any two, we can deduce the third.

How about P (positive), P (positive, cancer), and P (positive, ¬cancer)? Again there are only two degrees of freedom among these three variables. The equation occupying the extra degree of freedom is

P (positive) = P (positive, cancer) + P (positive, ¬cancer) .

This is how P (positive) is computed to begin with; we figure out the number of women with breast cancer who have positive mammograms, and the number of women without breast cancer who have positive mammograms, then add them together to get the total number of women with positive mammograms. It would be very strange to go out and conduct a study to determine the number of women with positive mammograms— just that one number and nothing else—but in theory you could do so. And if you then conducted another study and found the number of those women who had positive mammograms and breast cancer, you would also know the number of women with positive mammograms and no breast cancer—either a woman with a positive mammogram has breast cancer or she doesn’t. In general, P (A, B) + P (A, ¬B) = P (A). Symmetrically, P (A, B) + P (¬A, B) = P (B).

What about P (positive, cancer), P (positive, ¬cancer), P (¬positive, cancer), and P (¬positive, ¬cancer)? You might at first be tempted to think that there are only two degrees of freedom for these four quantities—that you can, for example, get P (positive, ¬cancer) by multiplying P (positive) × P(¬cancer), and thus that all four quantities can be found given only the two quantities P(positive) and P(cancer). This is not the case! P (positive, ¬cancer) = P (positive) × P (¬cancer) only if the two probabilities are statistically independent—if the chance that a woman has breast cancer has no bearing on whether she has a positive mammogram. This amounts to requiring that the two conditional probabilities be equal to each other—a requirement which would eliminate one degree of freedom. If you remember that these four quantities are the groups A, B, C, and D, you can look over those four groups and realize that, in theory, you can put any number of people into the four groups. If you start with a group of 80 women with breast cancer and positive mammograms, there’s no reason why you can’t add another group of 500 women with breast cancer and negative mammograms, followed by a group of 3 women without breast cancer and negative mammograms, and so on. So now it seems like the four quantities have four degrees of freedom. And they would, except that in expressing them as probabilities, we need to normalize them to fractions of the complete group, which adds the constraint that P (positive, cancer) + P (positive, ¬cancer) + P (¬positive, cancer) + P (¬positive, ¬cancer) = 1. This equation takes up one degree of freedom, leaving three degrees of freedom among the four quantities. If you specify the fractions of women in groups A, B, and D, you can deduce the fraction of women in group C.

Given the four groups A, B, C, and D, it is very straightforward to compute everything else:

P(cancer) = (A + B) / (A + B + C + D)

P (¬positive|cancer) = B / (A + B)

and so on. Since {A, B, C, D} contains three degrees of freedom, it follows that the entire set of probabilities relating cancer rates to test results contains only three degrees of freedom. Remember that in our problems we always needed three pieces of information—the prior probability and the two conditional probabilities—which, indeed, have three degrees of freedom among them. Actually, for Bayesian problems, any three quantities with three degrees of freedom between them should logically specify the entire problem.


The probability that a test gives a true positive divided by the probability that a test gives a false positive is known as the likelihood ratio of that test. The likelihood ratio for a positive result summarizes how much a positive result will slide the prior probability. Does the likelihood ratio of a medical test then sum up everything there is to know about the usefulness of the test?

No, it does not! The likelihood ratio sums up everything there is to know about the meaning of a positive result on the medical test, but the meaning of a negative result on the test is not specified, nor is the frequency with which the test is useful. For example, a mammography with a hit rate of 80% for patients with breast cancer and a false positive rate of 9.6% for healthy patients has the same likelihood ratio as a test with an 8% hit rate and a false positive rate of 0.96%. Although these two tests have the same likelihood ratio, the first test is more useful in every way—it detects disease more often, and a negative result is stronger evidence of health.


Suppose that you apply two tests for breast cancer in succession—say, a standard mammogram and also some other test which is independent of mammography. Since I don’t know of any such test that is independent of mammography, I’ll invent one for the purpose of this problem, and call it the Tams-Braylor Division Test, which checks to see if any cells are dividing more rapidly than other cells. We’ll suppose that the Tams-Braylor gives a true positive for 90% of patients with breast cancer, and gives a false positive for 5% of patients without cancer. Let’s say the prior prevalence of breast cancer is 1%. If a patient gets a positive result on her mammogram and her Tams-Braylor, what is the revised probability she has breast cancer?

One way to solve this problem would be to take the revised probability for a positive mammogram, which we already calculated as 7.8%, and plug that into the Tams-Braylor test as the new prior probability. If we do this, we find that the result comes out to 60%.

Suppose that the prior prevalence of breast cancer in a demographic is 1%. Suppose that we, as doctors, have a repertoire of three independent tests for breast cancer. Our first test, test A, a mammography, has a likelihood ratio of 80%/9.6% = 8.33. The second test, test B, has a likelihood ratio of 18.0 (for example, from 90% versus 5%); and the third test, test C, has a likelihood ratio of 3.5 (which could be from 70% versus 20%, or from 35% versus 10%; it makes no difference). Suppose a patient gets a positive result on all three tests. What is the probability the patient has breast cancer?

Here’s a fun trick for simplifying the bookkeeping. If the prior prevalence of breast cancer in a demographic is 1%, then 1 out of 100 women have breast cancer, and 99 out of 100 women do not have breast cancer. So if we rewrite the probability of 1% as an odds ratio, the odds are 1:99.

And the likelihood ratios of the three tests A, B, and C are:

8.33 : 1 = 25 : 3

18.0 : 1 = 18 : 1

3.5 : 1 = 7 : 2 .

The odds for women with breast cancer who score positive on all three tests, versus women without breast cancer who score positive on all three tests, will equal:

1 × 25 × 18 × 7 : 99 × 3 × 1 × 2 = 3150 : 594.

To recover the probability from the odds, we just write:

3150/(3150 + 594) = 84% .

This always works regardless of how the odds ratios are written; i.e., 8.33:1 is just the same as 25:3 or 75:9. It doesn’t matter in what order the tests are administered, or in what order the results are computed. The proof is left as an exercise for the reader.


E. T. Jaynes, in Probability Theory With Applications in Science and Engineering, suggests that credibility and evidence should be measured in decibels.[5]

Decibels?

Decibels are used for measuring exponential differences of intensity. For example, if the sound from an automobile horn carries 10,000 times as much energy (per square meter per second) as the sound from an alarm clock, the automobile horn would be 40 decibels louder. The sound of a bird singing might carry 1,000 times less energy than an alarm clock, and hence would be 30 decibels softer. To get the number of decibels, you take the logarithm base 10 and multiply by 10:

decibels = 10log_10(intensity)

intensity = 10^(decibels/10) .

Suppose we start with a prior probability of 1% that a woman has breast cancer, corresponding to an odds ratio of 1:99. And then we administer three tests of likelihood ratios 25:3, 18:1, and 7:2. You could multiply those numbers . . . or you could just add their logarithms:

10log_10(1/99) ≈ −20

10log_10(25/3) ≈ 9

10log_10(18/1) ≈ 13

10 log_10(7/2) ≈ 5 .

It starts out as fairly unlikely that a woman has breast cancer—our credibility level is at −20 decibels. Then three test results come in, corresponding to 9, 13, and 5 decibels of evidence. This raises the credibility level by a total of 27 decibels, meaning that the prior credibility of −20 decibels goes to a posterior credibility of 7 decibels. So the odds go from 1:99 to 5:1, and the probability goes from 1% to around 83%.


You are a mechanic for gizmos. When a gizmo stops working, it is due to a blocked hose 30% of the time. If a gizmo’s hose is blocked, there is a 45% probability that prodding the gizmo will produce sparks. If a gizmo’s hose is unblocked, there is only a 5% chance that prodding the gizmo will produce sparks. A customer brings you a malfunctioning gizmo. You prod the gizmo and find that it produces sparks. What is the probability that a spark-producing gizmo has a blocked hose?

What is the sequence of arithmetical operations that you performed to solve this problem?

(45% × 30%)/(45% × 30% + 5% × 70%)

Similarly, to find the chance that a woman with a positive mammogram has breast cancer, we computed:

[P (positive|cancer) × P (cancer)] / [P(positive|cancer)×P(cancer) 􏰆 + P (positive|¬cancer) × P (¬cancer)]

which is

P (positive, cancer) / [P (positive, cancer) + P (positive, ¬cancer)]

which is

P (positive, cancer) / P (positive)

which is

P(cancer|positive) .

The fully general form of this calculation is known as Bayes’s Theorem or Bayes’s Rule.

Bayes’s Theorem:

P(A|X) = [P(X|A) × P(A)] / [P(X|A) × P(A) + P(X|¬A) × P(¬A) ]

When there is some phenomenon A that we want to investigate, and an observation X that is evidence about A—for example, in the previous example, A is breast cancer and X is a positive mammogram—Bayes’s Theorem tells us how we should update our probability of A, given the new evidence X.

By this point, Bayes’s Theorem may seem blatantly obvious or even tautological, rather than exciting and new. If so, this introduction has entirely succeeded in its purpose.


Bayes’s Theorem describes what makes something “evidence” and how much evidence it is. Statistical models are judged by comparison to the Bayesian method because, in statistics, the Bayesian method is as good as it gets—the Bayesian method defines the maximum amount of mileage you can get out of a given piece of evidence, in the same way that thermodynamics defines the maximum amount of work you can get out of a temperature differential. This is why you hear cognitive scientists talking about Bayesian reasoners. In cognitive science, Bayesian reasoner is the technically precise code word that we use to mean rational mind.

There are also a number of general heuristics about human reasoning that you can learn from looking at Bayes’s Theorem.

For example, in many discussions of Bayes’s Theorem, you may hear cognitive psychologists saying that people do not take prior frequencies sufficiently into account, meaning that when people approach a problem where there’s some evidence X indicating that condition A might hold true, they tend to judge A’s likelihood solely by how well the evidence X seems to match A, without taking into account the prior frequency of A. If you think, for example, that under the mammography example, the woman’s chance of having breast cancer is in the range of 70%–80%, then this kind of reasoning is insensitive to the prior frequency given in the problem; it doesn’t notice whether 1% of women or 10% of women start out having breast cancer. “Pay more attention to the prior frequency!” is one of the many things that humans need to bear in mind to partially compensate for our built-in inadequacies.

A related error is to pay too much attention to P (X |A) and not enough to P(X|¬A) when determining how much evidence X is for A. The degree to which a result X is evidence for A depends not only on the strength of the statement we’d expect to see result X if A were true, but also on the strength of the statement we wouldn’t expect to see result X if A weren’t true. For example, if it is raining, this very strongly implies the grass is wet—P (wetgrass|rain) ≈ 1— but seeing that the grass is wet doesn’t necessarily mean that it has just rained; perhaps the sprinkler was turned on, or you’re looking at the early morning dew. Since P (wetgrass|¬rain) is substantially greater than zero, P (rain|wetgrass) is substantially less than one. On the other hand, if the grass was never wet when it wasn’t raining, then knowing that the grass was wet would always show that it was raining, P (rain|wetgrass) ≈ 1, even if P (wetgrass|rain) = 50%; that is, even if the grass only got wet 50% of the times it rained. Evidence is always the result of the differential between the two conditional probabilities. Strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X.

The Bayesian revolution in the sciences is fueled, not only by more and more cognitive scientists suddenly noticing that mental phenomena have Bayesian structure in them; not only by scientists in every field learning to judge their statistical methods by comparison with the Bayesian method; but also by the idea that science itself is a special case of Bayes’s Theorem; experimental evidence is Bayesian evidence. The Bayesian revolutionaries hold that when you perform an experiment and get evidence that “confirms” or “disconfirms” your theory, this confirmation and disconfirmation is governed by the Bayesian rules. For example, you have to take into account not only whether your theory predicts the phenomenon, but whether other possible explanations also predict the phenomenon.

Previously, the most popular philosophy of science was probably Karl Popper’s falsificationism—this is the old philosophy that the Bayesian revolution is currently dethroning. Karl Popper’s idea that theories can be definitely falsified, but never definitely confirmed, is yet another special case of the Bayesian rules; if P(X|A) ≈ 1—if the theory makes a definite prediction—then observing ¬X very strongly falsifies A. On the other hand, if P(X|A) ≈ 1, and we observe X, this doesn’t definitely confirm the theory; there might be some other condition B such that P (X|B) ≈ 1, in which case observing X doesn’t favor A over B. For observing X to definitely confirm A, we would have to know, not that P(X|A) ≈ 1, but that P(X|¬A) ≈ 0, which is something that we can’t know because we can’t range over all possible alternative explanations. For example, when Einstein’s theory of General Relativity toppled Newton’s incredibly well-confirmed theory of gravity, it turned out that all of Newton’s predictions were just a special case of Einstein’s predictions.

You can even formalize Popper’s philosophy mathematically. The likelihood ratio for X, the quantity P(X|A)/P(X|¬A), determines how much observing X slides the probability for A; the likelihood ratio is what says how strong X is as evidence. Well, in your theory A, you can predict X with probability 1, if you like; but you can’t control the denominator of the likelihood ratio, P(X|¬A)—there will always be some alternative theories that also predict X, and while we go with the simplest theory that fits the current evidence, you may someday encounter some evidence that an alternative theory predicts but your theory does not. That’s the hidden gotcha that toppled Newton’s theory of gravity. So there’s a limit on how much mileage you can get from successful predictions; there’s a limit on how high the likelihood ratio goes for confirmatory evidence.

On the other hand, if you encounter some piece of evidence Y that is definitely not predicted by your theory, this is enormously strong evidence against your theory. If P (Y |A) is infinitesimal, then the likelihood ratio will also be infinitesimal. For example, if P (Y |A) is 0.0001%, and P (Y |¬A) is 1%, then the likelihood ratio P (Y |A)/P (Y |¬A) will be 1:10,000. That’s −40 decibels of evidence! Or, flipping the likelihood ratio, if P (Y |A) is very small, then P (Y |¬A)/P (Y |A) will be very large, meaning that observing Y greatly favors ¬A over A. Falsification is much stronger than confirmation. This is a consequence of the earlier point that very strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X. This is the precise Bayesian rule that underlies the heuristic value of Popper’s falsificationism.

Similarly, Popper’s dictum that an idea must be falsifiable can be interpreted as a manifestation of the Bayesian conservation-of-probability rule; if a result X is positive evidence for the theory, then the result ¬X would have disconfirmed the theory to some extent. If you try to interpret both X and ¬X as “confirming” the theory, the Bayesian rules say this is impossible! To increase the probability of a theory you must expose it to tests that can potentially decrease its probability; this is not just a rule for detecting would-be cheaters in the social process of science, but a consequence of Bayesian probability theory. On the other hand, Popper’s idea that there is only falsification and no such thing as confirmation turns out to be incorrect. Bayes’s Theorem shows that falsification is very strong evidence compared to confirmation, but falsification is still probabilistic in nature; it is not governed by fundamentally different rules from confirmation, as Popper argued.

So we find that many phenomena in the cognitive sciences, plus the statistical methods used by scientists, plus the scientific method itself, are all turning out to be special cases of Bayes’s Theorem. Hence the Bayesian revolution.


Having introduced Bayes’s Theorem explicitly, we can explicitly discuss its components.

P(A|X) = [P(X|A) × P(A)] / [ P(X|A) × P(A) + P(X|¬A) × P(¬A) ]

We’ll start with P(A|X). If you ever find yourself getting confused about what’s A and what’s X in Bayes’s Theorem, start with P(A|X) on the left side of the equation; that’s the simplest part to interpret. In P(A|X), A is the thing we want to know about. X is how we’re observing it; X is the evidence we’re using to make inferences about A. Remember that for every expression P(Q|P), we want to know about the probability for Q given P, the degree to which P implies Q—a more sensible notation, which it is now too late to adopt, would be P (Q ← P ).

P (Q|P ) is closely related to P (Q, P ), but they are not identical. Expressed as a probability or a fraction, P (Q, P ) is the proportion of things that have property Q and property P among all things; e.g., the proportion of “women with breast cancer and a positive mammogram” within the group of all women. If the total number of women is 10,000, and 80 women have breast cancer and a positive mammogram, then P (Q, P ) is 80/10,000 = 0.8%. You might say that the absolute quantity, 80, is being normalized to a probability relative to the group of all women. Or to make it clearer, suppose that there’s a group of 641 women with breast cancer and a positive mammogram within a total sample group of 89,031 women. Six hundred and forty-one is the absolute quantity. If you pick out a random woman from the entire sample, then the probability you’ll pick a woman with breast cancer and a positive mammogram is P (Q, P ), or 0.72% (in this example).

On the other hand, P (Q|P ) is the proportion of things that have property Q and property P among all things that have P ; e.g., the proportion of women with breast cancer and a positive mammogram within the group of all women with positive mammograms. If there are 641 women with breast cancer and positive mammograms, 7,915 women with positive mammograms, and 89,031 women, then P (Q, P ) is the probability of getting one of those 641 women if you’re picking at random from the entire group of 89,031, while P (Q|P ) is the probability of getting one of those 641 women if you’re picking at random from the smaller group of 7,915.

In a sense, P (Q|P ) really means P (Q, P |P ), but specifying the extra P all the time would be redundant. You already know it has property P, so the property you’re investigating is Q—even though you’re looking at the size of group (Q,P) within group P, not the size of group Q within group P (which would be nonsense). This is what it means to take the property on the right-hand side as given; it means you know you’re working only within the group of things that have property P. When you constrict your focus of attention to see only this smaller group, many other probabilities change. If you’re taking P as given, then P (Q, P ) equals just P (Q)—at least, relative to the group P . The old P (Q), the frequency of “things that have property Q within the entire sample,” is revised to the new frequency of “things that have property Q within the subsample of things that have property P. ” If P is given, if P is our entire world, then looking for (Q, P ) is the same as looking for just Q.

If you constrict your focus of attention to only the population of eggs that are painted blue, then suddenly “the probability that an egg contains a pearl” becomes a different number; this proportion is different for the population of blue eggs than the population of all eggs. The given, the property that constricts our focus of attention, is always on the right side of P (Q|P ); the P becomes our world, the entire thing we see, and on the other side of the “given” P always has probability 1—that is what it means to take P as given. So P (Q|P ) means “If P has probability 1, what is the probability of Q?” or “If we constrict our attention to only things or events where P is true, what is the probability of Q?” The statement Q, on the other side of the given, is not certain—its probability may be 10% or 90% or any other number. So when you use Bayes’s Theorem, and you write the part on the left side as P(A|X)—how to update the probability of A after seeing X, the new probability of A given that we know X, the degree to which X implies A—you can tell that X is always the observation or the evidence, and A is the property being investigated, the thing you want to know about.


The right side of Bayes’s Theorem is derived from the left side through these steps:

P (A|X) = P (A|X)

P(A|X)= P(X,A) / P(X)

P(A|X) = P(X,A) / [ P(X,A) + P(X,¬A)]

P(A|X) =[ P(X|A) × P(A)] / [ P(X|A) × P(A) + P(X|¬A) × P(¬A) ] .

Once the derivation is finished, all the implications on the right side of the equation are of the form P(X|A) or P(X|¬A), while the implication on the left side is P(A|X). The symmetry arises because the elementary causal relations are generally implications from facts to observations, e.g., from breast cancer to positive mammogram. The elementary steps in reasoning are generally implications from observations to facts, e.g., from a positive mammogram to breast cancer. The left side of Bayes’s Theorem is an elementary inferential step from the observation of positive mammogram to the conclusion of an increased probability of breast cancer. Implication is written right-to-left, so we write P (cancer|positive) on the left side of the equation. The right side of Bayes’s Theorem describes the elementary causal steps—for example, from breast cancer to a positive mammogram—and so the implications on the right side of Bayes’s Theorem take the form P (positive|cancer) or P (positive|¬cancer).

And that’s Bayes’s Theorem. Rational inference on the left end, physical causality on the right end; an equation with mind on one side and reality on the other. Remember how the scientific method turned out to be a special case of Bayes’s Theorem? If you wanted to put it poetically, you could say that Bayes’s Theorem binds reasoning into the physical universe.

Okay, we’re done.


Reverend Bayes says:

You are now an initiate of the Bayesian Conspiracy.


1. Ward Casscells, Arno Schoenberger, and Thomas Graboys, “Interpretation by Physicians of Clinical Laboratory Results,” New England Journal of Medicine 299 (1978): 999–1001.
2. David M. Eddy, “Probabilistic Reasoning in Clinical Medicine: Problems and Opportunities,” in Judgement Under Uncertainty: Heuristics and Biases, ed. Daniel Kahneman, Paul Slovic, and Amos Tversky (Cambridge University Press, 1982).
3. Gerd Gigerenzer and Ulrich Hoffrage, “How to Improve Bayesian Reasoning without Instruction: Frequency Formats,” Psychological Review 102 (1995): 684–704.
4. Ibid.
5. Edwin T. Jaynes, “Probability Theory, with Applications in Science and Engineering,” Unpublished manuscript (1974).

The first publication of this post is here.

10 comments

Comments sorted by oldest first, as this post is from before comment nesting was available (around 2009-02-27).

comment by Erhannis · 2020-09-02T14:34:43.535Z · LW(p) · GW(p)

"When you are finished reading this, you will see Bayesian problems in your dreams."

Whaddaya know; he was right.

Also, yes the other version (on Arbital) is better, with more information - though this one has a point or two that aren't in the other version, like the discussion of degrees of freedom.

comment by Majd · 2021-05-29T03:25:50.324Z · LW(p) · GW(p)

Beautiful Eliezer. Written 18 years ago when I was just 6.

comment by Charles Paul (charles-paul) · 2021-06-24T03:08:14.328Z · LW(p) · GW(p)

I loved the article, the only thing is: would it be possible to move it to the beginning of the sequences? I think it would really help people to better understand things if they started out understanding Bayes

comment by Blacknsilver · 2021-08-29T11:16:47.777Z · LW(p) · GW(p)

I would consider myself smarter than at least 95% of people and I couldn't complete the egg problem even with a piece of paper, much less without. I think Eliezer massively overestimates the ability of the average person to do mental math.

Replies from: Eliyoole
comment by Elias (Eliyoole) · 2022-02-15T17:12:51.894Z · LW(p) · GW(p)

Well, intelligence doesn't equate skills. It's probably easier to aquire skills (like mental maths) with high intelligence, but no matter the intelligence, you still need to learn it.

P(easy learning | high intelligence) may be higher than P(easy learning | not high intelligence) for a given subject (f.e. mental math), but P(mental math) is not dependent on the ease of learning [P(mental math | no easy learning) would be low] but rather on actually learning/training it: P(mental math | no learning) is pretty low.

So people who learn mental math may have different speeds or difficulty doing so, however I would guess that it is more dependent on educational context, curiosity, or need(,...), rather than ease of learning.

But, if your self-assessment is correct, and my mentioned assumptions are as well, you should be able to remedy the predicament relatively easily ;)

Replies from: None
comment by [deleted] · 2023-03-28T15:18:42.636Z · LW(p) · GW(p)

The individual calculations of the egg question seem within the realm of what most people can do to me. It's possible others might have trouble keeping multiple numbers in mind at once, though.
- multiply 30% with 40% (which can be simplified in a number of ways)
- multiply 10% with 60%
- add 6% + 12%
- divide 12 by 18

It's probably easier to aquire skills (like mental maths) with high intelligence, but no matter the intelligence, you still need to learn it.

Also, I'm not sure this is fully true. Personally, learning math was enough to automatically start doing it mentally, and I never needed to specifically try to learn it.

comment by tmercer · 2022-06-05T18:30:00.975Z · LW(p) · GW(p)

"Popper’s idea that there is only falsification and no such thing as confirmation turns out to be incorrect."

No, that is still correct. Even if you only probabilistically falsify hypotheses, you can only FAIL TO DISCONFIRM/FALSIFY hypotheses. THERE IS ABSOLUTELY NO SUCH THING AS CONFIRMATION.

You can NEVER, EVER get more certain about something just because you fail to get less certain of it, UNLESS there are a set of MECE hypotheses, in which case getting more certain one is wrong IS getting more certain the other(s) are right.

Replies from: gwern
comment by gwern · 2022-06-05T19:30:08.244Z · LW(p) · GW(p)

Since you don't seem to be getting it from the downvotes but are still leaving comments, I will be blunt: your comments are bad because they consist purely of lazy low-effort assertions that add nothing to anything and convey an attitude of uninterest in becoming less wrong because you are certain you are already wholly right. No one here cares who you are such that your mere assertion is interesting, nor does anyone care what you think is "is bullshit", nor is capslock cruise control for cool. You aren't even making particularly novel assertions overall - eg. most people on LW interested in the topic think ego depletion is pretty dubious at this point (as you would know if you had bothered to take a few seconds to search the site instead of reading 13 year old posts).

Replies from: tmercer
comment by tmercer · 2022-06-27T23:22:30.827Z · LW(p) · GW(p)

LOL, I don't care about downvotes, and I capitalize for emphasis, not to appear "cool" to accounts in cyberspace. I don't care if you think my comments are "bad". My comments are either right or wrong. It doesn't matter whether you or anyone thinks they're good/bad. It doesn't matter whether you or anyone upvotes/downvotes them.

My assertions are CORRECT, and counter to Eliezer. That's reason enough to make them.

Not sure why you're mentioning ego depletion, but it's not "dubious", it's something without any substantial evidence it exists. The existence of the celestial teapot isn't "dubious", just like the existence of pseudoscience like willpower depletion isn't "dubious". We (rational minds) have a duty to not even consider not-real things until we get sufficient evidence.

I'm re-reading the whole site. That a post is 13 years old is irrelevant. All the assertions in posts, new and old, need to be true. We're on LW, after all.

comment by gilch · 2023-05-05T17:49:08.839Z · LW(p) · GW(p)

I think the unit should have been decibans, not decibels. The math works out the same, but I've seen some posts use one or the other unit, and I keep forgetting the names, which makes them hard to find. Other related units are the nat, ban, shannon, hartley, and dit.