Eliezer's example on Bayesian statistics is wr... oops!

zane

Eliezer's example on Bayesian statistics is wr... oops!

post by Zane · 2023-10-17T18:38:18.327Z · LW · GW · 13 comments

13 comments

This post was going to be an explanation of how an example Eliezer Yudkowsky frequently uses in discussion of statistics doesn't actually imply what he thinks it does. In the process of proving his mistake, I found out I was actually wrong. I'm still writing the post up, and it might end up interesting to someone else making the same mistake as me.

The example in question is one that Eliezer often uses when arguing that Bayesian statistics are superior to frequentist ones. Here [LW · GW] in the Sequences, here on Arbital, here in Glowfic, and a couple other times I can't find at the moment. A coin is flipped six times; it comes up heads the first five times, and tails the sixth. But we don't know whether the experimenter had A) decided beforehand to flip the coin six times, and reported what happened, or B) decided beforehand to flip the coin over and over until it came up tails, and report however long that took. What information does this give us about the coin's bias, or lack thereof?

According to a frequentist perspective, there's a serious difference in the p-values one would assign in these two different scenarios. In the first case, the result HHHHHT is placed into the class of results that are at least that extreme relative to the "null hypothesis" - of which there are 14: HHHHHH, TTTTTT, HHHHHT, TTTTTH, HHHHTH, TTTTHT, and so on. With 64 possibilities in total, 14/64 = 0.22, which is far above the p=0.05 level and therefore not enough to conclude significance. In the second case, the result HHHHHT is instead placed into a different class of results: 5 heads and then a tail, 6 heads and then a tail, 7 heads and then a tail, and so on forever. The probability of that entire class is 1/32 in total, which is 0.03 - statistically significant!

Eliezer criticizes this in several ways. To start out with, the part where a frequentist decides to lump in the actual result with a group of results that are similar is subjective enough to allow for a lot of freedom in what the actual result is. Maybe in the first case, instead of choosing the class of results with 5 or more of the same side, you only choose the class of results with 5 or more heads in particular, thus halving the p=0.22 to p=0.11. He also criticizes the very notion of significance being determined by "rejecting the null hypothesis" rather than looking at different theoretical effect sizes and how well they would have predicted the data. Two experiments that are evidence supporting entirely different effect sizes are treated as both "rejecting the null hypothesis" and thus evidence towards the same theory, even if the results are inconsistent with each other.

All of this criticism of frequentist statistics and p-values seemed to be correct. But the analysis of how a Bayesian would update was different.

...a Bayesian looks at the experimental result and says, "I can now calculate the likelihood ratio (evidential flow) between all hypotheses under consideration. Since your state of mind doesn't affect the coin in any way - doesn't change the probability of a fair coin or biased coin producing this exact data - there's no way your private, unobservable state of mind can affect my interpretation of your experimental results."
If you're used to Bayesian methods, it may seem difficult to even imagine that the statistical interpretation of the evidence ought to depend on a factor - namely the experimenter's state of mind - which has no causal connection whatsoever to the experimental result. (Since Bayes says that evidence is about correlation, and no systematic correlation can appear without causal connection; evidence requires entanglement [? · GW].)

So Eliezer is arguing that the likelihood ratios should obviously be the same in both scenarios, because the only relevant data is what sequence of flips the coin produced. The experimenter's state of mind doesn't change the probability that a coin of a given bias would produce this data, so it's irrelevant.

But the key element Eliezer seems to be missing here is that the total sum of the data is not "The coin came up HHHHHT." The data that we received is, instead, "The experimenter saw the coin come up HHHHHT." And that is the sort of evidence that is causally entangled with the experimenter's state of mind, because the experimenter's state of mind determines in which cases the experimenter will ever see the coin come up HHHHHT. If the real fact of the matter was that the coin really was fair, for example, the experimenter's state of mind following the "flip until you get tails" rule causes it to be less likely that the experimenter will ever get to the point of having a sixth flip in the first place, because there is now a 31/32 chance the experimenter would stop before flip six. Evidence plus the knowledge that the evidence is filtered can often have different properties than the unquoted evidence would have on its own [LW · GW]; the experimenter's method of deciding which evidence to search for changes which evidence they are likely to find.

The two possibilities produce very different prior distributions over possible outcomes. Assuming that the actual bias of the coin is such that the theoretical frequency of heads is f: then in the "flip n times" case, the prior is distributed between all possible sequences of length n, with each one having a probability of f^(number of total heads in the sequence) * (1-f)^(number of tails in the sequence). (In the case where f = 1/2 and n = 6, this simply reduces to () for each sequence.) Meanwhile, in the "flip until you get tails" case, the prior is a 1-f chance of T, f(1-f) chance of HT, $f^{2}$ (1-f) chance of HHT, $f^{3}$ (1-f) chance of HHHT, and so on, always ending up at $f^{(l - 1)} (1 - f)$ for a sequence of total length l. These priors both assign very different probabilities to different outcomes - in fact, there are many outcomes permitted by versions of the first that have zero probability on the second (like HTHTHTHT) or permitted by the second but with zero probability on low-n versions of the first (like HHHHHHHHHHHHT if n=6.)

So the probabilities over what happens during the experiment are very different, depending on whether you're flipping n times or flipping until you get tails. Do those different conditional priors mean that the final likelihood ratios would end up being different, contrary to what Eliezer claimed? That's what I thought.

After all, the way that Bayesian updating works in the first place is that you determine what the probabilities assigned to the experimental result were under different hypotheses (possible values of f), construct a likelihood distribution of what those probabilities were as a function of f, and multiply that by your prior distribution to update it. So the fact that the prior distributions conditional on the two types of experiments were so different would cause the things you were multiplying by to be different, and therefore give you different results.

(Feel free to pause and look for my error here, if you haven't found it yet.)

The error I made was that I was confusing the difference in the experimental designs' probability distributions over experimental results, with a difference in the likelihood distribution over hypotheses about the coin that the experiment would cause you to update towards. It is the case that the different experimental designs cause some experimental results to occur at different frequencies, but that does not automatically imply that the final update about the coin's bias will be different.

Whether the update about the coin is different depends only on the probabilities assigned to the result that actually happened. It doesn't matter if they assign wildly different probabilities to results like HHT and HTHTTH, if the experiment turns up HHHHHT and they assign the same probability to that. Which is, in fact, the case. In fact, while the distributions of experimental results look quite different at most values, they will always happen to cross at the exact value that actually turns out to be the result of the experiment, by some amazing not-really-a-coincidence.

The reason this happens is because, while the "flip until you get tails" is truly a constant probability distribution for any given value of f, the "flip n times" also depends on the value of n, making it really n separate distributions that happen to be similar to each other. If it takes 7 flips rather than 6 to get tails in the "flip until you get tails" experiment, that doesn't mean that you suddenly have gotten a result that would have been impossible on any "flip n times" distribution, it just means that you move to the "flip 7 times" distribution rather than "flip 6 times", and the probability of HHHHHHT on that distribution will end up matching the probability as assigned by the "flip until you get tails" distribution.

(The algebra here is simple enough. As said before, the probability of getting a given sequence for "flip n times" is f^(number of total heads in the sequence) * (1-f)^(number of tails in the sequence). But assuming that the sequence is one in which every value is heads, except for the last, which is tails, this reduces to $f^{(n - 1)} (1 - f)^{1}$ . This is identical to the previously described function for "flip until you get tails.")

See here for what the likelihood distribution over the coin's bias f really does look like after seeing n-1 heads and 1 tails, regardless of which of the two experimental designs is used.

So why am I writing this post, if I turned out to be wrong as a simple question of fact? There are a few things I sure learned here, and it seems possible someone else is also confused and able to learn something from them too.

The first lesson is just to be more careful about checking what precisely a probability distribution is telling you! In my initial calculations, I made a lot of mistakes before I could even start to be sure about what it was that was confusing me, several of which I haven't even mentioned here (like initially modelling something using a binomial distribution that really wasn't applicable there.) Most of these mistakes were of the nature of "I'm looking for a probability distribution over something something conditional on x; this thing here is a probability distribution over something something conditional on x; therefore it is the distribution I'm looking for." There's a difference between the distribution over experimental results given a particular type of experiment, and the likelihood ratio over hypotheses given the observation of a particular result; there's a difference between any particular version of a distribution dependent on one of the function's parameters, and the overall class of distributions formed from all possible values of that parameter.

The second thing I learned is that Bayesian likelihood ratios really do only depend on the probability each hypothesis assigned only to the information that you received, and nothing else. Which I verbally knew before, but I hadn't truly internalized. If two hypotheses assign the same probability to an outcome, and you see that outcome, that tells you nothing about any difference between the hypotheses. If I had ignored trying to quantify over all the possible outcomes, and just asked the comparatively simpler question of "what is the chance of HHHHHT in experiment 1, and in experiment 2," I probably could have solved it a lot more quickly.

And then there's also a possible lesson for me to learn of "see, you really should meta-level trust the reasoning of Eliezer Yudkowsky and other people who have more expertise in a given mathematical domain." I am not sure this is a good lesson to learn. And I'm also not sure that Eliezer actually saw all of the reasoning I went through in this post about why the two experiments assign the same probabilities to the actual result, rather than just guessing and happening to be correct. That being said, it still is the case that I would have previously given this as an example of a situation in which Eliezer Yudkowsky was wrong about basic probability theory (and I also would have said something like "and he probably made this mistake because of motivated reasoning in order to score points against frequentists"). And he turned out to be right all along. This is more likely in worlds where he knows his stuff more, and I have correspondingly updated my beliefs.

(I hope this goes without saying, but I'll say it anyway: a meta-level update towards trusting experts' math, does not mean first-order conforming to their opinions if you don't first-order agree with or understand them. I'll still keep trying to notice and point out when it looks like Eliezer is wrong about something - even if I might not bet as strongly that he really does turn out to be wrong.)

13 comments

Comments sorted by top scores.

comment by Radford Neal · 2023-10-17T22:05:21.821Z · LW(p) · GW(p)

I think you've got his pretty much figured out. But you may be missing an additional subtlety.

You say "Bayesian likelihood ratios really do only depend on the probability each hypothesis assigned only to the information that you received". Which could be interpreted as saying that the "likelihood function" is the probability assigned to the information received, seen as a function of f. But the likelihood function is actually not a function at all, but rather an equivalence class of functions of f that differ only by an overall positive scale factor.

You can see how this matters when comparing a report of an observation of 6 flips, such as HHHHHT, versus a report that gives only the number of tails, which is 1 in this case. The probability of HHHHHT as a function of f is , but the probability of 1 tail is $6 (1 - f)^{5} f$ , which is not the same function, but is in the same equivalence class, since it differs only by an overall factor of 6. Of course, this overall scale factor cancels out when looking at likelihood ratios for different values of f.

Replies from: Zane

↑ comment by Zane · 2023-10-18T00:09:37.857Z · LW(p) · GW(p)

Yeah, I discovered that part on accident at one point because I used the binomial distribution equation in a situation where it didn't really apply, but still got the right answer.

I would think the most natural way to write a likelihood function would be to divide by the integral from 0 to 1, so that the total area under the curve is 1. That way the integral from a to b gives the probability the hypothesis assigns to receiving a result between a and b. But all that really matters is the ratios, which stay the same even without that.

Replies from: Radford Neal

↑ comment by Radford Neal · 2023-10-18T01:38:46.060Z · LW(p) · GW(p)

Integrals of the likelihood function aren't really meaningful, even if normalized so the integral is one over the whole range. This is because the result depends on the arbitrary choice of parameterization - eg, whether you parameterize a probability by p in [0,1], or by log(p) in [-oo,0]. In Bayesian inference, one always integrates the likelihood only after multiplying by the prior, which can be seen as a specification of how the integration is to be done.

comment by Algon · 2023-10-17T21:15:51.611Z · LW(p) · GW(p)

Hey, I remember doing that exercise! I did the calculation right but was confused about the interpretation. Then when I thought of how the likelihood was moving as the experimenter followed their procedure, things became clear.

comment by LGS · 2023-10-20T08:38:14.277Z · LW(p) · GW(p)

The narrow point regarding likelihood ratios is correct, but the broader point in Eliezer's posts is arguably wrong. The issue with p-hacking is in large part selectively reporting results, and you don't get out of that by any amount of repeating the word "Bayesian". (For example, if I flip 10 coins but only show you the heads, you'll see HHHH, and no amount of Bayesian-ness will fix the problem; this is how publication bias works.)

Aside from selective reporting, much of the problem with p-values is that there's a specific choice of threshold (usually 0.05). This is a problem with likelihood ratios also. Eliezer says

Therefore likelihood functions can never be p-hacked by any possible clever setup without you outright lying, because you can't have any possible procedure that a Bayesian knows in advance will make them update in a predictable net direction. For every update that we expect to be produced by a piece of evidence , there's an equal and opposite update that we expect to probably occur from seeing $\neg e$ .

The second sentence is true, but this only implies you cannot be p-hacked in expectation. I can still manipulate the probability that you'll pass any given likelihood, and therefore I can still p-hack to some extent if we are talking about passing a specific threshold (which is, after all, the whole point of the original concept of p-hacking).

Think about it like this: suppose I am gambling in a casino where every bet has expectation 0. Then, on expectation, I can never make money, no matter my strategy. However, suppose that I can get my drug approved by a regulator if I earn 10x my investment in this casino. I can increase my chances of doing this (e.g. I can get the chance up to 10% if I'm willing to lose all my money the rest of the time), or, if I'm stupid, I can play a strategy that never achieves this (e.g. I make some double-or-nothing 50/50 bet). So I still have incentives to "hack", though the returns aren't infinite.

Basically, Eliezer is right that if I have to report all my data, I cannot fool you in expectation. He neglects that I can still manipulate the distribution over the possible likelihood ratios, so I still have some hacking ability. He also neglects the bigger problem, which is that I don't have to report all my data (for example, due to publication bias).

Replies from: Radford Neal

↑ comment by Radford Neal · 2023-10-20T13:42:02.986Z · LW(p) · GW(p)

If you say that you are reporting all your observations, but actually report only a favourable subset of them, and the Bayesian for some reason assigns low probability to you deceiving them in this way, when actually you are deceiving them, then the Bayesian will come to the wrong conclusion. I don't think this is surprising or controversial.

But I don't see how the Bayesian comes to a wrong conclusion if you truthfully report all your observations, even if they are taken according to some scheme that produces a distribution of likelihood ratios that is supposedly favourable to you. The distribution doesn't matter. Only the observed likelihood ratio matters.

For example, suppose you want to convince the Bayesian that H is true with probability greater than 0.9. Some experiments may never produce data giving a likelihood ratio extreme enough to produce such a high probability. So you don't do such an experiment, and instead do one that could conceivably produce an extreme likelihood ratio. But it probably won't, if H is not actually true. If it does produce strong evidence for H, the Bayesian is right to think that H is probably true, regardless of your motivations (as long as you truthfully report all the data).

Replies from: LGS

↑ comment by LGS · 2023-10-21T05:51:54.260Z · LW(p) · GW(p)

If you say that you are reporting all your observations, but actually report only a favourable subset of them, and the Bayesian for some reason assigns low probability to you deceiving them in this way, when actually you are deceiving them, then the Bayesian will come to the wrong conclusion. I don't think this is surprising or controversial.

OK but please attempt to square this with Yudkowsky's claim:

Therefore likelihood functions can never be p-hacked by any possible clever setup without you outright lying, because you can't have any possible procedure that a Bayesian knows in advance will make them update in a predictable net direction.

I am saying that Yudkowsky is just plain wrong here, because omitting info is not the same as outright lying. And publication bias happens when the person omitting the info is not even the same one as the person publishing the study (null results are often never published).

This is just one way to p-hack a Bayesian; there are plenty of others, including the most common type of p-hack ever, the forking paths (e.g. this xkcd still works the same if you report likelihoods).

But I don't see how the Bayesian comes to a wrong conclusion if you truthfully report all your observations, even if they are taken according to some scheme that produces a distribution of likelihood ratios that is supposedly favourable to you. The distribution doesn't matter. Only the observed likelihood ratio matters.

I'm not sure what you mean by "wrong conclusion" exactly, but I'll note that your statement here is more-or-less also true for p-values. The main difference is that p-values try to only convince you the null hypothesis is false, which is an easier task; the likelihood ratio tries to convince you some specific alternate hypothesis has higher likelihood, which is necessarily a harder task.

Even with Eliezer's original setup, in which the only thing I can control is when to stop the coin flip, it is hard to get p<0.001. Moreover, if I do manage to get p<0.001, that same sequence of coins will have a likelihood ratio of something like 100:1 in favor of the coin having a mild bias, if my calculation is correct. A large part of Eliezer's trick in his program's simulation is that he looked at the likelihood ratio of 50% heads vs 55% heads; such a specific choice of hypotheses is much harder to hack than if you let me choose the hypotheses after I saw the coinflips (I may need to compare the 50% to 60% or to 52% to get an impressive likelihood ratio, depending on the number of coins I flipped before stopping).

For example, suppose you want to convince the Bayesian that H is true with probability greater than 0.9. Some experiments may never produce data giving a likelihood ratio extreme enough to produce such a high probability. So you don't do such an experiment, and instead do one that could conceivably produce an extreme likelihood ratio. But it probably won't, if H is not actually true. If it does produce strong evidence for H, the Bayesian is right to think that H is probably true, regardless of your motivations (as long as you truthfully report all the data).

This is never the scenario, though. It is very easy to tell that the coin is not 90% biased no matter what statistics you use. The scenario is usually that my drug improves outcomes a little bit, and I'm not sure how much exactly. I want to convince you it improves outcomes, but we don't know in advance how much exactly they improve. Perhaps we set a minimum threshold, like the coin needs to be biased at least 55% or else we don't approve the drug, but even then there's no maximum threshold, so there is no fixed likelihood ratio we're computing. Moreover, we agree in advance on some fixed likelihood ratio that you need to reach to approve my drug; let's say 20:1 in favor of some bias larger than 55%. Then I can get a lot of mileage out of designing my experiment very carefully to target that specific threshold (though of course I can never guarantee success, so I have to try multiple colors of jelly beans until I succeed).

Replies from: Radford Neal

↑ comment by Radford Neal · 2023-10-21T13:32:56.909Z · LW(p) · GW(p)

I am saying that Yudkowsky is just plain wrong here, because omitting info is not the same as outright lying.

This is silly. Obviously, Yudkowsky isn't going to go off on a tangent about all the ways people can lie indirectly, and how a Bayesian ought to account for such possibilities - that's not the topic. In a scientific paper, it is implicit that all relevant information must be disclosed - not doing so is lying. Similarly, a scientific journal must ethically publish papers based on quality, not conclusion. They're lying if they don't. As for authors just not submitting papers with undesirable conclusions - well, that's a known phenomenon, that one should account for, along with the possibility that a cosmic ray has flipped a bit in the memory of the computer that you used for data analysis, and the possibility that you misremembered something about one of the studies, and a million other possibilities that one can't possibly discuss in every blog post.

This is never the scenario, though. It is very easy to tell that the coin is not 90% biased no matter what statistics you use.

You misunderstand. H is some hypothesis, not necessarily about coins. Your goal is to convince the Bayesian that H is true with probability greater than 0.9. This has nothing to do with whether some coin lands heads with probability greater than 0.9.

I can get a lot of mileage out of designing my experiment very carefully to target that specific threshold (though of course I can never guarantee success, so I have to try multiple colors of jelly beans until I succeed).

I don't think so, except, as I mentioned, that you obviously will do an experiment that could conceivably give evidence meeting the threshold - I suppose that you can think about exactly which experiment is best very carefully, but that isn't going to lead to anyone making wrong conclusions.

The person evaluating the evidence knows that you're going to try multiple colors. A frequentist would handle this with some sort of p-value correction. A Bayesian handles this by a small prior probability of the drug working, which may partly be based on the knowledge that if drugs of this class (set of colors) had a high probability of working, there would probably already be evidence of this. But this has nothing to do with the point about the stopping rule for coin flips not affecting the likelihood ratio, and hence the Bayesian conclusion, whereas it does affect the p-value.

Replies from: LGS

↑ comment by LGS · 2023-10-21T21:47:47.258Z · LW(p) · GW(p)

This is silly. Obviously, Yudkowsky isn't going to go off on a tangent about all the ways people can lie indirectly, and how a Bayesian ought to account for such possibilities - that's not the topic. In a scientific paper, it is implicit that all relevant information must be disclosed - not doing so is lying. Similarly, a scientific journal must ethically publish papers based on quality, not conclusion. They're lying if they don't.

You're welcome to play semantic games if you wish, but that's not how most people use the word "lying" and not how most people understand Yudkowsky's post.

By this token, p-values also can never be hacked, because doing so is lying. (I can just define lying to be anything that hacks the p-values, which is what you seem to be doing here when you say that not publishing a paper amounts to lying.)

You misunderstand. H is some hypothesis, not necessarily about coins. Your goal is to convince the Bayesian that H is true with probability greater than 0.9. This has nothing to do with whether some coin lands heads with probability greater than 0.9.

You're switching goalposts. Yudkowsky was talking exclusively about how I can affect the likelihood ratio. You're switching to talking about how I can affect your posterior. Obviously, your posterior depends on your prior, so with sufficiently good prior you'll be right about everything. This is why I didn't understand you originally: you (a) used H for "hypothesis" instead of for "heads" as in the main post; and (b) used 0.9 for a posterior probability instead of using 10:1 for a likelihood ratio.

I don't think so, except, as I mentioned, that you obviously will do an experiment that could conceivably give evidence meeting the threshold - I suppose that you can think about exactly which experiment is best very carefully, but that isn't going to lead to anyone making wrong conclusions.

To the extent you're saying something true here, it is also true for p values. To the extent you're saying something that's not true for p values, it's also false for likelihood ratios (if I get to pick the alternate hypothesis).

The person evaluating the evidence knows that you're going to try multiple colors.

No, they don't. That is precisely the point of p-hacking.

But this has nothing to do with the point about the stopping rule for coin flips not affecting the likelihood ratio, and hence the Bayesian conclusion, whereas it does affect the p-value.

The stopping rule is not a central example of p-hacking and never was. But even for the stopping rule for coin flips, if you let me choose the alternate hypothesis instead of keeping it fixed, I can manipulate the likelihood ratio. And note that this is the more realistic scenario in real experiments! If I do an experiment, you generally don't know the precise alternate hypothesis in advance -- you want to test if the coin is fair, but you don't know precisely what bias it will have if it's unfair.

If we fix the two alternate hypotheses in advance, and if I have to report all data, then I'm reduced to only hacking by choosing the experiment that maximizes the chance of luckily passing your threshold via fluke. This is unlikely, as you say, so it's a weak form of "hacking". But this is also what I'm reduced to in the frequentist world! Bayesianism doesn't actually help. They key was (a) you forced me to disclose all data, and (b) we picked the alternate hypothesis in advance instead of only having a null hypothesis.

(In fact I'd argue that likelihood ratios are fundamentally frequentist, philosophically speaking, so long as we have two fixed hypotheses in advance. It only becomes Bayesian once you apply it to your priors.)

Replies from: Radford Neal

↑ comment by Radford Neal · 2023-10-22T00:01:59.812Z · LW(p) · GW(p)

If I do an experiment, you generally don't know the precise alternate hypothesis in advance -- you want to test if the coin is fair, but you don't know precisely what bias it will have if it's unfair.

Yes. But as far as I can see this isn't of any particular importance to this discussion. Why do you think it is?

Actually, a frequentist can just keep collecting more data until they get p<0.05, then declare the null hypothesis to be rejected. No lying or suppression of data required. They can always do this, even if the null hypothesis is true: After collecting data points, they have a 0.05 chance of seeing p<0.05. If they don't, they then collect $n K$ more data points, where $K$ is big enough that whatever happened with the first $n$ data points makes little difference to the p-value, so there's still about a 0.05 chance that p<0.05. If that doesn't produce a rejection, they collect $n K^{2}$ more data points, and so on until they manage to get p<0.05, which is guaranteed to happen eventually with probability 1.

But they aren't guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.

Replies from: LGS

↑ comment by LGS · 2023-10-22T03:51:37.360Z · LW(p) · GW(p)

Yes. But as far as I can see this isn't of any particular importance to this discussion. Why do you think it is?

It's the key of my point, but you're right that I should clarify the math here. Consider this part:

Actually, a frequentist can just keep collecting more data until they get p<0.05, then declare the null hypothesis to be rejected. No lying or suppression of data required. They can always do this, even if the null hypothesis is true: After collecting data points, they have a 0.05 chance of seeing p<0.05. If they don't, they then collect $n K$ more data points, where $K$ is big enough that whatever happened with the first $n$ data points makes little difference to the p-value, so there's still about a 0.05 chance that p<0.05. If that doesn't produce a rejection, they collect $n K^{2}$ more data points, and so on until they manage to get p<0.05, which is guaranteed to happen eventually with probability 1.

This is true for one hypothesis. It is NOT true if you know the alternative hypothesis. That is to say: suppose you are checking the p-value BOTH for the null hypothesis bias=0.5, AND for the alternate hypothesis bias=0.55. You check both p-values and see which is smaller. Now it is no longer true that you can keep collecting more data until their desired hypothesis wins; if the truth is bias=0.5, then after enough flips, the alternative hypothesis will never win again, and will always have astronomically small p-value.

To repeat: yes, you can disprove bias=0.5 with p<0.05; but at the time this happens, the alternative hypothesis of bias=0.55 might be disproven at p<10^{-100}. You are no longer guaranteed to win when there are two hypotheses rather than one.

But they aren't guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.

Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.

More specifically, I claim that if a sequence of coin flip outcomes disproves bias=0.5 at some p-value p, then for the same sequence of coin flips, there exists a bias b such that the likelihood ratio between bias b and bias 0.5 is $O (1 / p) : 1$ . I'm not sure what the exact constant in the big-O notation is (I was trying to calculate it, and I think it's at most 10). Suppose it's 10. Then if you have p=0.001, you'll have likelihood ratio 100:1 for some bias.

Therefore, to get the likelihood ratio as high as you wish, you could employ the following strategy. First, flip coins until the p value is very low, as you described. Then stop, and analyze the sequence of coin flips to determine the special bias b in my claimed theorem above. Then publish a paper claiming "the bias of the coin is b rather than 0.5, here's my super high likelihood ratio". This is guaranteed to work (with enough coinflips).

(Generally, if the number of coin flips is N, the bias b will be on the order of $1 / 2 \pm O (1 / \sqrt{N})$ , so it will be pretty close to 1/2; but once again, this is no different for what happens with the frequentist case, because to ensure the p-value is small you'll have to accept the effect size being small.)

Replies from: Radford Neal

↑ comment by Radford Neal · 2023-10-22T13:00:45.253Z · LW(p) · GW(p)

OK. I think we may agree on the technical points. The issue may be with the use of the word "Bayesian".

Me: But they aren't guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.

You: Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.

I'm correct, by the usual definition of "Bayesian", as someone who does inference by combining likelihood and prior. Bayesians always have more than one hypothesis (outside trivial situations where everything is known with certainty), with priors over them. In the example I gave, one can find a b such that the likelihood ratio with 0.5 is large, but the set of such b values will likely have low prior probability, so the Bayesian probably isn't fooled. In contrast, a frequentist "pure significance test" does involve only one explicit hypothesis, though the choice of test statistic must in practice embody some implicit notion of what the alternative might be.

Beyond this, I'm not really interested in debating to what extent Yudkowsky did or did not understand all nuances of this problem.

Replies from: LGS

↑ comment by LGS · 2023-10-22T22:05:20.285Z · LW(p) · GW(p)

A platonically perfect Bayesian given complete information and with accurate priors cannot be substantially fooled. But once again this is true regardless of whether I report p-values or likelihood ratios. p-values are fine.

Eliezer's example on Bayesian statistics is wr... oops!

Contents

13 comments