[Link] Better results by changing Bayes’ theorem
post by XiXiDu · 2012-03-09T19:38:58.325Z · LW · GW · Legacy · 17 commentsContents
Peter Norvig - The Unreasonable Effectiveness of Data None 17 comments
If it ever turns out that Bayes fails - receives systematically lower rewards on some problem, relative to a superior alternative, in virtue of its mere decisions - then Bayes has to go out the window.
-- Eliezer Yudkowsky, Newcomb's Problem and Regret of Rationality
Don't worry, we don't have to abandon Bayes’ theorem yet. But changing it slightly seems to be the winning Way given certain circumstances. See below:
In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.
So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.
In other words, if we change Bayes’ theorem (!) we get a better result. He goes on to explain
Link: johndcook.com/blog/2012/03/09/monkeying-with-bayes-theorem/
Peter Norvig - The Unreasonable Effectiveness of Data
17 comments
Comments sorted by top scores.
comment by jimrandomh · 2012-03-09T21:12:53.397Z · LW(p) · GW(p)
A little context: translating foreign (f) to English (e) is finding the most-probable English text e for a given foreign phrase,
argmax[e] Pr(e|f)
= argmax[e] Pr(e)*Pr(f|e) / Pr(f)
= argmax[e] Pr(e)*Pr(f|e)
(By applying Bayes rule, and then renormalizing). They found that emprically, it worked better to instead find
argmax[e] Pr(e)^1.5 * Pr(f|e)
What this means is that whatever's generating Pr(f|e) is generating overconfident numbers (or equivalently in this case, that whatever generates Pr(e) is generating underconfident numbers). This corrects for that.
It's a little confusing that this was presented as a modification to Bayes' rule, rather than a calibration factor applied to the underlying estimators, but it's really the latter. The reason for putting it here, rather than there, is probably because if the calibration were done to the original estimates, it would introduce a spurious degree of freedom, since only the relative weights matter.
Replies from: DanielVarga↑ comment by DanielVarga · 2012-03-10T00:11:09.970Z · LW(p) · GW(p)
Excellent explanation. I would add that the source of this overconfidence is not a mystery at all. Models for estimating Pr(f|e) are so ridiculously simplistic that a layperson would laugh us out if we explained them to her in plain English instead of formulas. For example, P(f|e) was sometimes defined as the probability that we can produce f from e by first applying a randomly chosen lexicon translation for each word of e, and then do a random local reordering of words. Here the whole responsibility of finding a random reordering that leads to a grammatical English sentence rests on the shoulders of Pr(e). It's almost like the translation model spits out a bag of words, and the language model has to assemble them into a chain of words. (The above simple example is far from being state of the art, but actual state of the art it is not that much more realistic either.)
comment by orthonormal · 2012-03-10T00:48:17.374Z · LW(p) · GW(p)
Bayes is optimal if you throw all your knowledge into the equation. That's practically infeasible, so this program throws away most of the data first, and then applies a heuristic to the remaining data. There's no guarantee that applying Bayes' Rule directly to the remaining data will outperform another heuristic, just that both would be outperformed by running the ideal version of Bayes (and including everything we know about grammar, among other missing data).
Replies from: endoself↑ comment by endoself · 2012-03-10T01:38:30.551Z · LW(p) · GW(p)
I don't follow. The heuristic isn't using any of the thrown-away data, its just using the same data they used Bayes on. That is to say, someone who had only the information that Norvig actually used would be able to apply Bayes using all their knowledge or they could use this heuristic with all their knowledge and the heuristic would come out better.
This could possibly be explained if the heuristic also embodied some background information that allows us to correct for overconfidence if P(f|e) that isn't explicitly mentioned in the data, as suggested by DanielVarga, or if the heuristic was effectively preforming an expected utility calculation, as suggested in my other comment.
Replies from: orthonormal↑ comment by orthonormal · 2012-03-10T16:01:33.107Z · LW(p) · GW(p)
This could possibly be explained if the heuristic also embodied some background information that allows us to correct for overconfidence if P(f|e) that isn't explicitly mentioned in the data
Exactly. If there's some structure to the full dataset that's unrecoverable from the part that's kept, you can code that structure into a heuristic which will outperform Naive Bayes on the remaining data- but an ideal Bayesian reasoner with access to the full dataset would have picked up that structure as well, and you wouldn't be outperforming xer.
So the post is evidence of interesting structure in word frequencies, not a deficiency of Bayes' Rule.
comment by gwern · 2012-03-09T19:48:25.451Z · LW(p) · GW(p)
Why is this interesting?
Replies from: David_Gerard↑ comment by David_Gerard · 2012-03-09T20:20:53.968Z · LW(p) · GW(p)
Erm, its relation as an example to the quote the article starts with?
Replies from: gwern↑ comment by gwern · 2012-03-09T20:23:56.874Z · LW(p) · GW(p)
Any other machine learning algorithm is a potential example for the observation 'naive token-by-token Bayes can be outperformed'.
Replies from: XiXiDu↑ comment by XiXiDu · 2012-03-09T20:35:54.443Z · LW(p) · GW(p)
Any other machine learning algorithm is a potential example for the observation 'naive token-by-token Bayes can be outperformed'.
I see. I didn't know this was the case. I am sorry if this is something really banal, it was new to me. Feel free to downvote it.
comment by gwern · 2012-03-12T18:36:43.891Z · LW(p) · GW(p)
Hacker News discussion: http://news.ycombinator.com/item?id=3693447
Top comment is srean, which I will shamelessly copy here:
Replies from: XiXiDuThis is more of tweak of naive Bayes than Bayes' theorem and I suspect he is being a bit tongue in cheek and not letting on whats behind the tweak.
I am sure you have heard that naive Bayes makes gratuitous assumptions of independence. What is not mentioned as often is that it also assumes the document has been generated by a memory-less process.
So if I were to generate a document according to the model assumed by naive Bayes, and I want to decide if I should concatenate another "the" in the document, then I dont need to keep track of how many "the"s that I have already added to the document. As a result the probability of multiple occurrence n_i of a word i goes down exponentially, like this
P(n_i) = p_i^n_i.
Many words do not behave like this. Their probability do not go down monotonically from the start, rather, for words like "the" their probabilities (conditioned on their history) climb initially and then go down.
Naive Bayes works surprisingly well in spite of their grossly violated assumptions. There are many explanations for that. However, we usually forget that to make NB work, we have to throw away "stop-words". Among them are those exact "memory-full" words that violate NB's assumptions.
Let's get back to word frequencies: A model that fits word frequencies quite well is the power law. http://en.wikipedia.org/wiki/Power_law They look like this
P(n_i) \propto n_i^c
where c is a constant for that word id i. For English, c usually lies in the ball park of -1.5 to -2.1
The tweak that Norvig mentions is not an exact match for power law assumptions but it comes very close. Its using a power law assumption but with sub-optimal parameters. In fact using the power law assumption and with their parameters estimated from the data you could get an even better classifier. Though be careful of the industry of seeing power laws everywhere. Log normal distributions can look deceptively similar to a power law and is more appropriate on many such occasions.
Yeah Naive Bayes has a bad assumption of independence, but there is no reason that they have to be memory-less too and that is partly what the tweak fixes, and the theorem isn't really being violated.
↑ comment by XiXiDu · 2012-03-12T19:37:48.494Z · LW(p) · GW(p)
Hacker News discussion: http://news.ycombinator.com/item?id=3693447
I should have posted it over at Hacker News and get +103 karma instead of -1 :-)
Replies from: gwern, wedrifid↑ comment by wedrifid · 2012-03-13T06:08:04.023Z · LW(p) · GW(p)
I should have posted it over at Hacker News and get +103 karma instead of -1 :-)
You could have got positive karma here too if you didn't present it incorrectly as a challenge to:
Replies from: XiXiDuIf it ever turns out that Bayes fails - receives systematically lower rewards on some problem, relative to a superior alternative, in virtue of its mere decisions - then Bayes has to go out the window.
-- Eliezer Yudkowsky, Newcomb's Problem and Regret of Rationality
↑ comment by XiXiDu · 2012-03-13T09:17:39.220Z · LW(p) · GW(p)
You could have got positive karma here too if you didn't present it incorrectly as a challenge to...
As a "challenge to"? I incorrectly assumed that it is a good example in support of something Eliezer said, namely that you choose the winning way and not your favorite heuristic.
Replies from: wedrifid↑ comment by wedrifid · 2012-03-14T01:24:14.653Z · LW(p) · GW(p)
As a "challenge to"? I incorrectly assumed that it is a good example in support of something Eliezer said, namely that you choose the winning way and not your favorite heuristic.
I suppose it would actually be a support of Eliezer's 'choose the winning way' theme while simultaneously undermining everything he said about probability theory.
Either way, it is the incorrect framing not just the location that dampened reception here.
comment by DanielLC · 2012-03-09T21:10:35.147Z · LW(p) · GW(p)
Two possible explanations come to mind:
They are defining "better" as something other than higher expected value of number of correctly translated words.
The probabilities they're using are biased, and this reverses the bias.
↑ comment by endoself · 2012-03-09T21:56:50.529Z · LW(p) · GW(p)
I haven't watched the video, but are they using expected value at all or are they just using the most likely word? Accidentally using a nonoptimal common word seems like it would produce a better translation than accidentally using a nonoptimal uncommon word, so this effect might just be making their algorithm more like expected utility and less like raw probabilities.