Bayesian probability theory as extended logic -- a new result

post by ksvanhorn · 2017-07-06T19:14:32.163Z · LW · GW · Legacy · 40 comments

Contents

40 comments

I have a new paper that strengthens the case for strong Bayesianism, a.k.a. One Magisterium Bayes. The paper is entitled "From propositional logic to plausible reasoning: a uniqueness theorem." (The preceding link will be good for a few weeks, after which only the preprint version will be available for free. I couldn't come up with the $2500 that Elsevier makes you pay to make your paper open-access.)

Some background: E. T. Jaynes took the position that (Bayesian) probability theory is an extension of propositional logic to handle degrees of certainty -- and appealed to Cox's Theorem to argue that probability theory is the only viable such extension, "the unique consistent rules for conducting inference (i.e. plausible reasoning) of any kind." This position is sometimes called strong Bayesianism. In a nutshell, frequentist statistics is fine for reasoning about frequencies of repeated events, but that's a very narrow class of questions; most of the time when researchers appeal to statistics, they want to know what they can conclude with what degree of certainty, and that is an epistemic question for which Bayesian statistics is the right tool, according to Cox's Theorem.

You can find a "guided tour" of Cox's Theorem here (see "Constructing a logic of plausible inference"). Here's a very brief summary. We write A | X for "the reasonable credibility" (plausibility) of proposition A when X is known to be true. Here X represents whatever information we have available. We are not at this point assuming that A | X is any sort of probability. A system of plausible reasoning is a set of rules for evaluating A | X. Cox proposed a handful of intuitively-appealing, qualitative requirements for any system of plausible reasoning, and showed that these requirements imply that any such system is just probability theory in disguise. That is, there necessarily exists an order-preserving isomorphism between plausibilities and probabilities such that A | X, after mapping from plausibilities to probabilities, respects the laws of probability.

Here is one (simplified and not 100% accurate) version of the assumptions required to obtain Cox's result:

 

  1. A | X is a real number.
  2. (A | X) = (B | X) whenever A and B are logically equivalent; furthermore, (A | X) ≤ (B | X) if B is a tautology (an expression that is logically true, such as (a or not a)).
  3. We can obtain (not A | X) from A | X via some non-increasing function S. That is, (not A | X) = S(A | X).
  4. We can obtain (A and B | X) from (B | X) and (A | B and X) via some continuous function F that is strictly increasing in both arguments: (A and B | X) = F((A | B and X), B | X).
  5. The set of triples (x,y,z) such that x = A|X, y = (B | A and X), and z = (C | A and B and X) for some proposition A, proposition B, proposition C, and state of information X, is dense. Loosely speaking, this means that if you give me any (x',y',z') in the appropriate range, I can find an (x,y,z) of the above form that is arbitrarily close to (x',y',z').
The "guided tour" mentioned above gives detailed rationales for all of these requirements.

Not everyone agrees that these assumptions are reasonable. My paper proposes an alternative set of assumptions that are intended to be less disputable, as every one of them is simply a requirement that some property already true of propositional logic continue to be true in our extended logic for plausible reasoning. Here are the alternative requirements:
  1. If X and Y are logically equivalent, and A and B are logically equivalent assuming X, then (A | X) = (B | Y).
  2. We may define a new propositional symbol s without affecting the plausibility of any proposition that does not mention that symbol. Specifically, if s is a propositional symbol not appearing in A, X, or E, then (A | X) = (A | (s ↔ E) and X).
  3. Adding irrelevant background information does not alter plausibilities. Specifically, if Y is a satisfiable propositional formula that uses no propositional symbol occurring in A or X, then (A | X) = (A | Y and X).
  4. The implication ordering is preserved: if  A → B is a logical consequence of X, but B → A is not, then then A | X < B | X; that is, A is strictly less plausible than B, assuming X.
Note that I do not assume that A | X is a real number. Item 4 above assumes only that there is some partial ordering on plausibility values: in some cases we can say that one plausibility is greater than another.

 

I also explicitly take the state of information X to be a propositional formula: all the background knowledge to which we have access is expressed in the form of logical statements. So, for example, if your background information is that you are tossing a six-sided die, you could express this by letting s1 mean "the die comes up 1," s2 mean "the die comes up 2," and so on, and your background information X would be a logical formula stating that exactly one of s1, ..., s6 is true, that is,

(s1 or s2 or s3 or s5 or s6) and
not (s1 and s2) and not (s1 and s3) and not (s1 and s4) and
not (s1 and s5) and not (s1 and s6) and not (s2 and s3) and
not (s2 and s4) and not (s2 and s5) and not (s2 and s6) and
not (s3 and s4) and not (s3 and s5) and not (s3 and s6) and
not (s4 and s5) and not (s4 and s6) and not (s5 and s6).

Just like Cox, I then show that there is an order-preserving isomorphism between plausibilities and probabilities that respects the laws of probability.

My result goes further, however, in that it gives actual numeric values for the probabilities. Imagine creating a truth table containing one row for each possible combination of truth values assigned to each atomic proposition appearing in either A or X. Let n be the number of rows in this table for which X evaluates true. Let m be the number of rows in this table for which both A and X evaluate true. If P is the function that maps plausibilities to probabilities, then P(A | X) = m / n.

For example, suppose that a and b are atomic propositions (not decomposable in terms of more primitive propositions), and suppose that we only know that at least one of them is true; what then is the probability that a is true? Start by enumerating all possible combinations of truth values for a and b:
  1. a false, b false: (a or b) is false, a is false.
  2. a false, b true : (a or b) is true,  a is false.
  3. a true,  b false: (a or b) is true,  a is true.
  4. a true,  b true : (a or b) is true,  a is true.
There are 3 cases (2, 3, and 4) in which (a or b) is true, and in 2 of these cases (3 and 4) a is also true. Therefore,

    P(a | a or b) = 2/3.

This concords with the classical definition of probability, which Laplace expressed as

The probability of an event is the ratio of the number of cases favorable to it, to the number of possible cases, when there is nothing to make us believe that one case should occur rather than any other, so that these cases are, for us, equally possible.

This definition fell out of favor, in part because of its apparent circularity. My result validates the classical definition and sharpens it. We can now say that a “possible case” is simply a truth assignment satisfying the premise X. We can simply drop the problematic phrase “these cases are, for us, equally possible.” The phrase “there is nothing to make us believe that one case should occur rather than any other” means that we possess no additional information that, if added to X, would expand by differing multiplicities the rows of the truth table for which X evaluates true.

For more details, see the paper linked above.

40 comments

Comments sorted by top scores.

comment by JohnnySullivan · 2017-07-28T20:11:13.310Z · LW(p) · GW(p)

Seems to be a typo here:

(s1 or s2 or s3 or s5 or s6) and not (s1 and s2) and not (s1 and s3) and not (s1 and s4) and not (s1 and s5) and not (s1 and s6) and not (s2 and s3) and not (s2 and s4) and not (s2 and s5) and not (s2 and s6) and not (s3 and s4) and not (s3 and s5) and not (s3 and s6) and not (s4 and s5) and not (s4 and s6) and not (s5 and s6).

I think you mean to add "or s4" on the first line:

(s1 or s2 or s3 or s4 s5 or s6) and not (s1 and s2) and not (s1 and s3) and not (s1 and s4) and not (s1 and s5) and not (s1 and s6) and not (s2 and s3) and not (s2 and s4) and not (s2 and s5) and not (s2 and s6) and not (s3 and s4) and not (s3 and s5) and not (s3 and s6) and not (s4 and s5) and not (s4 and s6) and not (s5 and s6).

comment by MrMind · 2017-07-07T15:55:44.265Z · LW(p) · GW(p)

I'm working on extending probability to predicate calculus and your work will be very precious, thanks!

Replies from: ksvanhorn
comment by ksvanhorn · 2017-07-07T16:15:42.951Z · LW(p) · GW(p)

If you haven't already, I would suggest you read Carnap's book, The Logical Foundations of Probability (there's a PDF of it somewhere online). As I recall, he ran into some issues with universally quantified statements -- they end up having zero probability in his system.

Replies from: MrMind
comment by MrMind · 2017-07-19T15:27:09.013Z · LW(p) · GW(p)

As I recall, he ran into some issues with universally quantified statements -- they end up having zero probability in his system.

Cox's probability is essentially probability defined on a Boolean algebra (the Lindenbaum-Tarski algebra of propositional logic).
Kolmogorov's probability is probability defined on a sigma-complete Boolean algebra.
If I can show that quantifiers are related to sigma-completeness (quantifiers are adjunctions in the proper pair of categories, but I've yet to look into that), then I can probably lift the equivalnce via the Loomis-Sikorski theorem back to the original algebras, and get exactly when a Cox's probability can be safely extended to predicate logic.
That's the dream, anyway.

Replies from: ksvanhorn
comment by ksvanhorn · 2017-07-20T19:29:17.838Z · LW(p) · GW(p)

I'd be interested in reading what you come up with once you're ready to share it.

One thing you might consider is whether sigma-completeness is really necessary, or whether a weaker concept will do. One can argue that, from the perspective of constructing a logical system, only computable countable unions are of interest, rather than arbitrary countable unions.

comment by ChristianKl · 2017-07-07T01:29:40.204Z · LW(p) · GW(p)

I don't think that changes much about the core argument. Chapman wrote in Probability theory does not extend logic :

Probability theory can be viewed as an extension of propositional calculus. Propositional calculus is described as “a logic,” for historical reasons, but it is not what is usually meant by “logic.”

[...]

Probability theory by itself cannot express relationships among multiple objects, as predicate calculus (i.e. “logic”) can. The two systems are typically combined in scientific practice.

Replies from: ksvanhorn, SoerenE, Kaj_Sotala, cousin_it
comment by SoerenE · 2017-07-07T09:24:21.949Z · LW(p) · GW(p)

I do not know enough about logic to be able to evaluate the argument. But from the Outside View, I am inclined to be skeptical about David Chapman:

DAVID CHAPMAN

"Describing myself as a Buddhist, engineer, scientist, and businessman (...) and as a pop spiritual philosopher“

Web-book in progress: Meaningness

Tagline: Better ways of thinking, feeling, and acting—around problems of meaning and meaninglessness; self and society; ethics, purpose, and value.

EDWIN THOMPSON JAYNES

Professor of Physics at Washington University

Most cited works:

Information theory and statistical mechanics - 10K citations

Probability theory: The logic of science - 5K citations

The tone of David Chapman's refutation:

E. T. Jaynes (...) was completely confused about the relationship between probability theory and logic. (...) He got confused by the word “Aristotelian”—or more exactly by the word “non-Aristotelian.” (...) Jaynes is just saying “I don’t understand this, so it must all be nonsense.”

Replies from: TheAncientGeek, ChristianKl, TheAncientGeek
comment by TheAncientGeek · 2017-07-11T15:17:14.217Z · LW(p) · GW(p)

Something you are not taking into account is that Chapman was born a lot later, Any undergraduate physicist can tell you where Newton went wrong.

Replies from: SoerenE
comment by SoerenE · 2017-07-11T19:22:20.572Z · LW(p) · GW(p)

I think difference in date of birth (1922 vs ~1960) is less important than difference of date of publication (2003 vs ~2015).

On the Outside View, is criticism 12 years after publication more likely to be valid than criticism levelled immediately? I do not know. On one hand, science generally improves over time. On the other hand, if a particular work get the first criticism after many years, it could mean that the work is of higher quality.

comment by ChristianKl · 2017-07-07T19:07:28.324Z · LW(p) · GW(p)

From the outside view, David Chapman is a MIT Phd who published papers on artificial intelligence.

From the outside view, I think AI credentials qualify a person more than physics credentials.

Replies from: SoerenE
comment by SoerenE · 2017-07-07T19:54:37.362Z · LW(p) · GW(p)

Thank you for pointing this out. I did not do my background check far enough back in time. This substantially weakens my case.

I am still inclined to be skeptical, and I have found another red flag. As far as I can tell, E. T. Jaynes is generally very highly regarded, and the only person who is critical of his book is David Chapman. This is just from doing a couple of searches on the Internet.

There are many people studying logic and probability. I would expect some of them would find it worthwhile to comment on this topic if they agreed with David Chapman.

Replies from: ChristianKl, TheAncientGeek
comment by ChristianKl · 2017-07-08T16:14:05.321Z · LW(p) · GW(p)

As far as I can tell, E. T. Jaynes is generally very highly regarded, and the only person who is critical of his book is David Chapman.

I don't think it's a good sign for a book if there isn't anybody to be found that criticizes it.

ksvanhorn's response that defends Jaynes still grants:

I agree with Chapman that probability theory does not extend the predicate calculus. I had thought this too obvious to mention, but perhaps it needs emphasizing for people who haven’t studied mathematical logic. Jaynes, in particular, was not versed in mathematical logic, so when he wrote about “probability theory as extended logic” he failed to properly identify which logic it extended.

[...]

My view is that the role of the predicate calculus in rationality is in model building. It gives us the tools to create mathematical models of various aspects of our world, and to reason about the properties of these models. The predicate calculus is indispensable for doing mathematics.

I think the view that Eliezer argues is that you can basically do all relevant reasoning with Bayes and not that you can't to reason well about the properties of mathematical models with Bayes.

Replies from: Oscar_Cunningham, SoerenE
comment by Oscar_Cunningham · 2017-07-09T02:09:55.174Z · LW(p) · GW(p)

FWIW Loads of people criticise Jaynes' book all the time.

Replies from: ChristianKl, SoerenE
comment by ChristianKl · 2017-07-09T17:43:02.888Z · LW(p) · GW(p)

It's still a bad argument to judge a book based on the fact that one is unable to find criticism.

comment by SoerenE · 2017-07-10T07:47:20.416Z · LW(p) · GW(p)

Could you post a link to a criticism similar to David Chapman?

The primary criticism I could find was the errata. From the Outside View, the errata looks like a number of mathematically minded people found it to be worth their time to submit corrections. If they had thought that E. T. Jaynes was hopelessly confused, they would not have submitted corrections of this kind.

Replies from: Oscar_Cunningham
comment by Oscar_Cunningham · 2017-07-10T21:31:22.298Z · LW(p) · GW(p)

I can't link to a criticism that makes the same points as Chapman, but my favourite criticism of Jaynes is the paper "Jaynes's maximum entropy prescription and probability theory" by Friedman and Shimony, criticising the MAXENT rule. It's behind a paywall, but there's an (actually much better) description of the same result in Section 5 of "The constraint rule of the maximum entropy principle" by Uffink. (It actually came out before PT:TLOS was published, but Jaynes' description of MAXENT doesn't change so the criticism still applies).

Replies from: SoerenE
comment by SoerenE · 2017-07-11T07:49:52.916Z · LW(p) · GW(p)

Yes! From the Outside View, this is exactly what I would expect substantial, well-researched criticism to look like. Appears very scientific, contains plenty of references, is peer-reviewed and published in "Journal of Statistical Physics" and has 29 citations.

Friedman and Shimonys criticism of MAXENT is in stark contrast to David Chapmans criticism of "Probability Theory".

Replies from: Oscar_Cunningham, TheAncientGeek
comment by Oscar_Cunningham · 2017-07-11T09:57:37.308Z · LW(p) · GW(p)

FWIW I think that Davud Chapman's criticism is correct as far as it goes, but I don't think that it's very damning. Propositional logic is indeed a "logic" and it's worthwhile enough for probability theory to extend it. Trying to look at predicate logic probabilisticly would be interesting but it's not necessary.

comment by TheAncientGeek · 2017-07-11T09:45:50.055Z · LW(p) · GW(p)

Chapman wasn't even attempting to write an original paper, and in fact points out early on that he is repeating well known (outside LW) facts.

comment by SoerenE · 2017-07-08T19:32:04.227Z · LW(p) · GW(p)

I don't think it's a good sign for a book if there isn't anybody to be found that criticizes it.

I think it is a good sign for a Mathematics book that there isn't anybody to be found that criticizes it except people with far inferior credentials.

comment by TheAncientGeek · 2017-07-11T09:52:33.077Z · LW(p) · GW(p)

As far as I can tell, E. T. Jaynes is generally very highly regarded, and the only person who is critical of his book is David Chapman.

Chapman doesn't criticise Jaynes directly, he criticises what he calls Pop Bayesianism.

Replies from: SoerenE
comment by SoerenE · 2017-07-11T12:37:22.521Z · LW(p) · GW(p)

I should clarify that I am referring to the section David Chapman calls: "Historical appendix: Where did the confusion come from?". I read it as a criticism of both Jaynes and his book.

comment by TheAncientGeek · 2017-07-11T09:55:38.586Z · LW(p) · GW(p)

I do not know enough about logic to be able to evaluate the argument.

Chapman's argument? Do you know enough logic to understand Yudkowsky's arguemtn, then?

Replies from: SoerenE
comment by SoerenE · 2017-07-11T12:10:28.282Z · LW(p) · GW(p)

No, I do know what Yudkowsky's argument is. Truth be told, I probably would be able to evaluate the arguments, but I have not considered it important. Should I look into it?

I care about whether "The Outside View" works as a technique for evaluating such controversies.

comment by Kaj_Sotala · 2017-07-08T20:45:08.250Z · LW(p) · GW(p)

Chapman on Twitter about the original post:

Not relevant to the propositional vs predicate issue I wrote about, but looks like an interesting alternative approach to Cox’s result.

comment by cousin_it · 2017-07-07T07:34:06.771Z · LW(p) · GW(p)

I've seen that article before, but can't quite understand it. Is there really a use for mixed sentences like "the probability that the probability that all ravens are black is 0.5 is 0.5"? It seems like both quantifiers and meta-probabilities are unnecessary, I can say all I want just by having a prior over states of the world with all its ravens. Relationships among multiple objects get folded into that as well.

Replies from: Dagon, TheAncientGeek, Lumifer
comment by Dagon · 2017-07-08T14:42:48.691Z · LW(p) · GW(p)

Sure, but you can't actually hold the probability vector over all states with ravens. So you move up a level and summarize that set of probabilities to a smaller (and less precise) set.

All uncertainty is map, not territory. Anytime you are using probability, you're acknowledging that you're a limited calculator that cannot hold the complete state of the universe. If you could, you wouldn't need probability, you'd actually know the thing.

Meta-models are useful when specific models get cumbersome. Likewise meta-probability.

Replies from: cousin_it
comment by cousin_it · 2017-07-08T23:18:48.467Z · LW(p) · GW(p)

You don't need meta-probability to compress priors. For example, a uniform prior on [0,1] talks about an uncountable set of events, but its description is tiny and doesn't use meta-probabilities.

Replies from: TheAncientGeek
comment by TheAncientGeek · 2017-07-11T15:46:32.705Z · LW(p) · GW(p)

And it's a special case.

comment by TheAncientGeek · 2017-07-07T09:28:01.639Z · LW(p) · GW(p)

How can you "have" an infinitely complex prior?

Replies from: cousin_it
comment by cousin_it · 2017-07-07T09:38:08.962Z · LW(p) · GW(p)

It doesn't have to be infinitely complex. Let's say there are only ten ravens and ten crows, each of which can be black or white. Chapman says I can't talk about them using probability theory because there are two kinds of objects, so I need meta-probabilities and quantifiers and whatnot. But I don't need any of that stuff, it's enough to have a prior over possible worlds, which would be finite and rather small.

Replies from: TheAncientGeek, TheAncientGeek
comment by TheAncientGeek · 2017-07-12T18:24:53.604Z · LW(p) · GW(p)

Only you need to keep switching priors to deal with one finite and small problem after another. Whatever that is, it is not strong Bayes.

comment by TheAncientGeek · 2017-07-09T13:49:41.004Z · LW(p) · GW(p)

That amounts to saying that Bayes works in finite, restricted cases, which no one is disputing. The thing is that you scheme doesn't work in the general case.

comment by Lumifer · 2017-07-07T14:35:55.806Z · LW(p) · GW(p)

I can say all I want just by having a prior over states of the world with all its ravens

No, you can't. Not in practice.

It's the same deal as with AIXI -- quite omnipotent in theory, can't do much of anything in reality. Take a real-life problem and show me your prior over all the states of the world.

Replies from: cousin_it
comment by cousin_it · 2017-07-07T15:35:32.512Z · LW(p) · GW(p)

All priors are over the state of the world, just coarse-grained :-) So any practical application of Bayesian statistics should suffice for your request.

Replies from: Lumifer
comment by Lumifer · 2017-07-07T15:50:07.418Z · LW(p) · GW(p)

So any practical application of Bayesian statistics should suffice for your request.

Any practical application does not give me an opportunity to "say all I want just by having a prior over states of the world" because it doesn't involve such a prior. A practical application sets out a model with some parameters and invites me to specify (preferably in a neat analytical form) the prior for these parameters.

comment by Elo · 2017-07-06T19:27:37.458Z · LW(p) · GW(p)

Suggest the paper be listed on library genesis. Or whichever service you choose.

comment by CronoDAS · 2017-07-07T15:17:39.575Z · LW(p) · GW(p)

Why is #4 above "less than" and not "less than or equal to"?

::thinks a bit::

What this is saying is, if there are logically possible worlds where A is false and B is true, but no logically possible worlds where A is true and B is false, then A is strictly less likely than B - that all logically possible worlds have nonzero probability. This is a pretty strong assumption...

Replies from: ksvanhorn
comment by ksvanhorn · 2017-07-07T15:56:58.041Z · LW(p) · GW(p)

Epistemic probabilities / plausibilities are not properties of the external world; they are properties of the information you have available. Recall that the premise X contains all the information you have available to assess plausibilities. If X does not rule out a possible world, what basis do you have for assigning it 0 probability? Put another way, how do you get to 100% confidence that this possible world is in fact impossible, when you have no information to rule it out?