# Looking for proof of conditional probability

post by DanielLC · 2011-07-28T02:24:00.286Z · score: 1 (7 votes) · LW · GW · Legacy · 33 commentsFrom what I understand, the Kolmogorov axioms make no mention of conditional probability. That is simply defined. If I really want to show how probability actually works, I'm not going to argue "by definition". Does anyone know a modified form that uses simpler axioms than P(A|B) = P(A∩B)/P(B)?

## 33 comments

Comments sorted by top scores.

Your question doesn't make any sense to me. I don't know what it means to "prove" a definition. Did you mean to ask for an (informal) *argument* that the concept is *useful*?

My confusion is compounded by the fact that I find P(A|B) = P(A∩B)/P(B) pretty self-explanatory. What seems to be the difficulty?

There definition is equivalent to having an axiom that states that P(A|B) = P(A∩B)/P(B). That's not that difficult a concept, but it's still more advanced than axioms tend to be. Compare it to the other three. It's like Euclid's fifth postulate.

But it's not an *axiom*; it's a *definition*.

It bothers me that you seem to be under the impression that the equation represents some kind of substantive *claim*. It doesn't; it's just the establishment of a shorthand notation. (It bothers me even more that other commenters don't seem to be noticing that you're suffering from a confusion about this.)

A reasonable question to ask might be: "why is the quantity P(A∩B)/P(B) interesting enough to be worth having a shorthand notation for?" But that isn't what you asked, and the answer wouldn't consist of a "proof", so despite its being the closest non-confused question to yours I'm not yet sure whether an attempt to answer it would be helpful to you.

If you simply view P(A|B) = P(A∩B)/P(B) as a shorthand, with "P(A|B)" as just an arbitrary symbol, then you're right - it needs no more explanation. But we don't consider P(A|B) to be just an arbitary symbol - we think it has a specific *meaning*, which is "the probability of A given B". And we think that "P(A∩B)/P(B)" has been chosen to equal "P(A|B)" because it has the properties we feel "the probability of A given B" should have.

I *think* DanielLC is asking why it is specifically P(A∩B)/P(B), and not some other formula, that has been chosen to correspond with the intuitive notion of "the probability of A given B".

In that case, it's no wonder that I'm having trouble relating, because I didn't understand what "the probability of A given B" meant until somebody told me it was P(A∩B)/P(B).

There is a larger point here:

But we don't consider P(A|B) to be just an arbitary symbol - we think it has a specific meaning, which is "the probability of A given B". And we think that "P(A∩B)/P(B)" has been chosen to equal "P(A|B)" because it has the properties we feel "the probability of A given B" should have.

In my opinion, an important part of learning to think mathematically is learning *not to think like this*. That is, not to think of symbols as having a mysterious "meaning" apart from their formal definitions.

This is what causes some people to have trouble accepting that 0.999.... = 1: they don't understand that the question of what 0.999.... "is" is simply a *matter of definition*, and not some mysterious empirical fact.

Paradoxically, this is a way in which my lack of "mathematical ability" is a kind of mathematical ability in its own right, because I often *don't have* these mysterious "intuitions" that other people seem to, and thus for me it tends to be second nature that the formal definition of something is what the thing *is*. For other people, I suppose, thinking this way is a kind of skill they have to consciously learn.

If I have a set of axioms, and I derive theorems from them, then anything that these axioms are true about, all the theorems are also true about. For example, suppose we took Euclid's first four postulates and derived a bunch of theorems from them. These postulates are true if you use them to describe figures on a plane, so the theorems are also true about those figures. This also works if it's on a sphere. It's not that a "point" means a spot on a plane, or two opposite spots on a sphere, it's just that the reasoning for abstract points applies to physical models.

Statistics isn't just those axioms. You might be able to find something else that those axioms apply to. If you do, every statistical theorem will also apply. It still wouldn't be statistics. Statistics is a specific application. P(A|B) represents something in this application. P(A|B) always equals P(A∩B)/P(B). We can find this out the same way we figured out that P(∅) always equals zero. It's just that the latter is more obvious than the former, and we may be able to derive the former from something else equally obvious.

That is, not to think of symbols as having a mysterious "meaning" apart from their formal definitions.

Pure formalism is useful for developing new math, but math cannot be applied to real problems without the assignment of meaning to the variables and equations. Most people are more interested in *using* math than in what amounts to intellectual play, as enjoyable and potentially useful as that can be. Note that I tend to be more of a formalist myself, which is why I mentioned in an old comment on HN that I tend to learn math concepts fairly easily, but have trouble applying it.

I find this set of answers being top rated quite disturbing to be honest.

There are several people in the same main thread pointing out that

a) There ways to define it that would make it obviously violating basic intuition and hence a disconnection of it from intuition does have limits

b) There are intuitive solutions to the problem that may reach a proof for it and hence the whole argument to be unfounded.

I agree with the OP: simply defining a probability concept doesn't by itself map it to our intuitions about it. For example, if we defined P(A|B) = P(AB) / 2P(B), it wouldn't correspond to our intuitions, and here's why.

Intuitively, P(A|B) is the probability of A happening if we know that B already happened. In other words, the entirety of the elementary outcome space we're taking into consideration now are those that correspond to B. Of those remaining elementary outcomes, the only ones that can lead to A are those that lie in AB. Their measure *in absolute terms* is equal to P(AB); however, their measure *in relation to the elementary outcomes in B* is equal to P(AB)/P(B).

Thus, P(A|B) is P(A) as it would be if the only elementary outcomes in existence were those yielding B. P(B) here is a normalizing coefficient: if we were evaluating the conditional probability of A in relation to a set of exhaustive and mutually exclusive experimental outcomes, as it is done in Bayesian reasoning, dividing by P(B) means renormalizing the elementary outcome space after B is fixed.

Basically, P(A|B) = 0 when A and B are disjoint, and P(A|C)/P(B|C) = P(A)/P(B) when A and B are subsets of C?

It's better, but it's still not that good. I have a sneaking suspicion that that's the best I can do, though.

Now, a hopefully intuitive explanation of independent events.

By definition, A is independent from B if P(A|B) = P(A), or equivalently P(AB) = P(A)P(B). What does it mean in terms of measures?

It is easy to prove that if A is independent from B, then A is also independent from ~B: P(A|~B) = P(A ~B) / P(~B) = (P(A) - P(AB)) / (1 - P(B)) = (P(A) - P(A)P(B)) / (1 - P(B)) = P(A).

Therefore, A is independent from B iff P(A) = P(AB) / P(B) = P(A ~B) / P(~B), which implies that **P(AB) / P(A ~B) = P(B) / P(~B)**.

Geometrically, it means that A intersects B and ~B with subsets of measures proportionate to the measures of B and ~B. So if P(B) = 1/4, then 1/4 of A lies in B, and the remaining 3/4 in ~B. And if B and ~B are equally likely, then A lies in equal shares of both.

And from an information-theoretic perspective, this geometric interpetation means that knowing whether B or ~B happened gives us no information about the relative likelihood of A, since it will be equally likely to occur in the renormalized outcome space either way.

I feel like independence really is just a definition, or at least something close to it. I guess P(A|B) = P(A|~B) might be better. Independence is just another way of saying that A is just as likely regardless of B.

P(A|B) = P(A|~B) is equivalent to the classic definition of independence, and intuitively it means that "whether B happens or not, it doesn't affect the likelihood of A happening".

I guess that since other basic probability concepts are defined in terms of set operations (union and intersection), and independence lacks a similar obvious explanation in terms of sets and measure, I wanted to find one.

and P(A|C)/P(B|C) = P(A)/P(B) when A and B are subsets of C?

When A is a subset of C, P(A|C) = P(A).

Syntactically, adding conditional probability doesn't do anything new, besides serving as short-hand for expressions that look like "P(A∩B)/P(B)".

The issue you're recognizing is that, semantically, conditional probabilities should mean something and behave coherently with the system of probability you've built up to this point.

This is not really a question of probability theory but instead a question of interpretations of probability. As such, it has a very philosophical feel and I'm not sure it's going to have the solid answers you're looking for.

Short answer: The Kolmogorov axioms are just mathematical. They have nothing inherently to do with the real world. P(A|B)=P(A∩B)/P(B) is the definition of P(A|B). There is a compelling argument that the beliefs of a rational agent should obey the Kolmogorov axioms, with P(A|B) corresponding to the degree of belief in A when B is known.

Long answer: I have a sequence of posts coming up.

There is a compelling argument that the beliefs of a rational agent should obey the Kolmogorov axioms, with P(A|B) corresponding to the degree of belief in A when B is known.

Are you thinking of this one, or something else?

I was thinking of the Dutch book argument others have mentioned. But I think you may have misunderstood my point.The original poster has summed up what I wanted to say better than I could:

If I have a set of axioms, and I derive theorems from them, then anything that these axioms are true about, all the theorems are also true about. For example, suppose we took Euclid's first four postulates and derived a bunch of theorems from them. These postulates are true if you use them to describe figures on a plane, so the theorems are also true about those figures. This also works if it's on a sphere. It's not that a "point" means a spot on a plane, or two opposite spots on a sphere, it's just that the reasoning for abstract points applies to physical models.

Statistics isn't just those axioms. You might be able to find something else that those axioms apply to. If you do, every statistical theorem will also apply. It still wouldn't be statistics. Statistics is a specific application. P(A|B) represents something in this application. P(A|B) always equals P(A∩B)/P(B). We can find this out the same way we figured out that P(∅) always equals zero. It's just that the latter is more obvious than the former, and we may be able to derive the former from something else equally obvious.

I agree with the first paragraph but the second seems confused. We want to show that P(A|B) defined as P(A∩B)/P(B) tells us how much weight to assign A given B. DanielLC seems to be looking for an *a priori* mathematical proof of this, but this is futile. We're trying to show that there is a correspondence between the laws of probability and something in the real world (the optimal beliefs of agents) , so we have to mention properties of the real world in our arguments.

I think you need the common-sense axioms P(B|B) = 1 and P(A|all possibilities) = P(A). Given these, the Venn diagram explanation is pretty straightforward.

Umm...I don't know how rigorous this explanation this is, but it might lead you in the right direction...because if you consider the Venn Diagram with probability spaces A and B, the probability space of A within B is given by the overlap of the two circles, or P(A∩B). Then you get the probability of landing in that space out of all the space in B...as in, the probability that if you choose circle B, you land in the overlap between A and B.

That's probably not what you were looking for, but hope it helps.

I'm not sure if this is what you're looking for, but I believe one way you can derive it is via a dutch book argument.

...which is done explicitly in Jay Kadane's free text,starting on page 29.

Snapping pointers: direct link. Commercial use forbidden (probably the reason for the pointer chain).

Ah, thanks. (It's done semi-explicitly right on the wiki page, though. Or at least an effectively general example is set up and the form of the proof is described. (ie, the only way that there wouldn't automatically be a solution to the equations to dutch book you would be if the system had a determinant of zero, and doing so forces the standard rule for conditional probability))

Why is P(A∩B)/P(B) called conditional probability? Or, let's turn it the other way round (which is your question), why would conditional probability be given by P(A∩B)/P(B)? I think I was able to develop a proof, see below. Of course, double-checking by others would be required.

First, I would define conditional probability as the “Probability of A knowing that B occurs”, which is meaningful and I guess everybody would agree on (see also wikipedia).

Starting from there, “Probability of A knowing that B occurs” means the probability of A in a restricted space where B is known to occur. This restricted space is simply B. Now, in this restricted space (forget about the larger space Ω for a minute), we want to know the probability of A. Well, the occurrence of A in B is given by A∩B. In case of equiprobability of elementary events, the probability of A∩B in B is card(A∩B)/card(B). Now, this can be tweaked as card(A∩B)/card(Ω)×card(Ω)/card(B)=P(A∩B)/P(B), where P is the probability in the original space Ω. The latter result being Kolmogorov’s definition of conditional probability!

Note that we can also think in terms of areas, writing that the probability of A∩B in B is the area of A∩B divided by the area of B: area(A∩B)/area(B), which can be tweaked the same way, and works without requiring equiprobability (and the probability space could also be non-discrete).

I'm sorry, the comments on this post all seem to miss the point. Bayes' Theorem can be proven from basic logic, look at places like the Khan Academy, or Lukeprog's Intuitive Explanation of Yudkowsky's Intuitive Explanation. Once you understand that, the Kolmogorov axioms will be obvious. It's not assumed,

Even Wikipedia notes that Cox's Theorem makes another approach possible -- that seems like the place to *start* looking if you want a mathematical proof. So I think Larks came close to the right question (though it may or may not address your concerns).

Cox and Jaynes show that we can start by requiring probability or the logic of uncertainty to have certain features. For example, our calculations should have a type of consistency such that it shouldn't matter to our final answer if we write P(A∩B) or P(B∩A). This, together with the other requirements, ultimately tells us that:

P(A∩B) = P(B)P(A|B) = P(A)P(B|A)

Which immediately gives us a possible justification for both the Kolmogorov definition and Bayes' Theorem.