# On stopping rules

post by Anders_H · 2015-08-02T21:38:08.617Z · LW · GW · Legacy · 9 comments

(tl;dr: In this post I try to explain why I think the stopping rule of an experiment matters. It is likely that someone will find a flaw in my reasoning. That would be a great outcome as it would help me change my mind.  Heads up: If you read this looking for new insight you may be disappointed to only find my confusion)

(Edited to add: Comments by Manfred and Ike seem to point correctly to the critical flaws in my reasoning. I will try to update my intuition over the next few days)

In the post "Don't You Care If It Works Part 1" on the Main section of this website, Jacobian writes:

A few weeks ago I started reading beautiful probability and immediately thought that Eliezer is wrong about the stopping rule mattering to inference. I dropped everything and spent the next three hours convincing myself that the stopping rule doesn't matter and I agree with Jaynes and Eliezer. As luck would have it, soon after that the stopping rule question was the topic of discussion at our local LW meetup. A couple people agreed with me and a couple didn't and tried to prove it with math, but most of the room seemed to hold a third opinion: they disagreed but didn't care to find out. I found that position quite mind-boggling. Ostensibly, most people are in that room because we read the sequences and thought that this EWOR (Eliezer's Way Of Rationality) thing is pretty cool. EWOR is an epistemology based on the mathematical rules of probability, and the dude who came up with it apparently does mathematics for a living trying to save the world. It doesn't seem like a stretch to think that if you disagree with Eliezer on a question of probability math, a question that he considers so obvious it requires no explanation, that's a big frickin' deal!

First, I'd like to point out that the mainstream academic term for Eliezer's claim is The Strong Likelihood Principle.  In the comments section, a vigorous discussion of stopping rules ensued.

My own intuition is that the strong likelihood principle is wrong.  Moreover, there exist a small number of people whose opinion I give higher level of credence than Eliezer's, and some of those people also disagree with him. For instance, I've been present in the room when a distinguished Professor of Biostatistics at Harvard stated matter-of-factly that the principle is trivially wrong. I also observed that he was not challenged on this by another full Professor of Biostatistics who is considered an expert on Bayesian inference.

So at best, the fact that Eliezer supports the strong likelihood principle is a single data point, ie pretty weak Bayesian evidence.  I do however value Eliezer's opinion, and in this case I recognize that I am confused. Being a good rationalist, I'm going to take that as an indication that it is time for The Ritual.  Writing this post is part of my "ritual": It is an attempt to clarify exactly why I think the stopping condition matters, and determine whether those reasons are valid.   I expect a likely outcome is that someone will identify a flaw in my reasoning. This will be very useful and help improve my map-territory correspondence.

--

Suppose there are two coins in existence, both of which are biased: Coin A comes up heads with probability 2/3 and tails with probability 1/3,  whereas Coin B comes up heads with probability 1/3.     Someone gives me a coin without telling me which one, my goal is to figure out if it is Coin A or Coin B.   My prior is that they are equally likely.

There are two statisticians who both offer to do an experiment:  Statistician 1 says that he will flip the coin 20 times and report the number of heads.    Statistician 2 would really like me to believe that it is Coin B, and says he will terminate the experiment whenever there are more tails than heads. However, since Statistician 2 is kind of lazy and doesn't have infinite time, he also says that if he reaches 20 flips he is going to call it quits and give up.

Both statisticians do the experiment, and both experiments end up with 12 heads and 8 tails. I trust both Statisticians to be honest about the experimental design and the stopping rules.

In the experiment of Statistician 1, the probability of getting this outcome if you have Coin A was 0.1486, whereas the probability of getting this outcome if it was Coin B was 0.0092.  The likelihood ratio is therefore 16.1521   and the posterior probability of Coin A (after converting the prior to odds, applying the likelihood ratio and converting back to probability) is 0.94.

In the experiment of Statistician 2, however, I can't just use the binomial distribution because there is an additional data point which is not Bernoulli, namely the number of coin flips.  I therefore have to calculate, for both Coin A and Coin B,  the probability that he would not terminate the experiment prior to the 20th flip, and that at that stage he would have 12 heads and 8 coins.    Since the probability reaching 20 flips is much higher for Coin A than for Coin B, the likelihood ratio would be much higher than in the experiment of Statistician 1.

This should not be unexpected: If Statistician B gives me data that supports the hypothesis which his stopping rule was designed to discredit, then that data is stronger evidence than similar data coming from the neutral Statistician A.

In other words, the stopping rule matters. Yes, all the evidence in the trial is still in the likelihood ratio, but the likelihood ratio is different because there is an additional data point.   Not considering this additional data point is statistical malpractice.

comment by Manfred · 2015-08-02T23:22:00.630Z · LW(p) · GW(p)

You are a little off :P

Let's do a toy problem. Suppose that Statistician 2 is extra-lazy, and will only flip the coin three times, again stopping if they ever have more heads than tails. And suppose that, again, they end up using up all the flips and have more tails than heads - in this case, 2 tails and 1 heads. Every time, they must get tails first, or else they would immediately stop, and then they either get the next two flips Heads-Tails or Tails-Heads - they can only get the sequences THT or TTH.

So P(THT+TTH | Coin A) = 4/27, while P(THT+TTH | Coin B) = 8/27. So statistician 2 will record twice as many of this result from coin B as from coin A. Thus statistician 2 claims that the probability of it being coin B is 2/3.

Compare this to Statistician 1: P(1 heads and 2 tails | Coin A) = P(HTT+THT+TTH | Coin A) = 6/27, while P(HTT+THT+TTH | Coin B) = 12/27. Thus statistician 1 thinks the probability of it being coin B is 2/3. The two statisticians get the same results!

This is a general pattern - because the trials are independent, when statistician 2 compares how many times they get a result with coin A vs coin B, the ratio (and thus the likelihood ratio) will be the same as for statistician 1 - number 2 just only accepts a smaller number of possible sequences of flips - but all of those sequences have the same ratio of probabilities.

Remember P(A|B) = P(A) * P(B|A) / P(B). Here P(B|A) is the probability of getting some number of heads and tails given a specific coin, and P(B) the probability of getting that result averaged over both coins. The size of P(B|A) itself doesn't matter, only the ratio P(B|A)/P(B).

This is not to say that statistician 2 can't cheat. All they have to do is to not publish results with more tails than heads. Now if you update straightforwardly on the published results, on average statistician 2 has biased you towards coin A. The only way to counteract this is if you know that this is what they're doing, and can update on your observations, not just their published observations.

Replies from: Manfred
comment by Manfred · 2015-08-02T23:45:36.870Z · LW(p) · GW(p)

There's another toy example that might help too. Suppose Statistician 2 is willing to flip the coin 3 times, but gets heads on the first flip and stops there. Surely you can't accept this data, or else you're practically guaranteed to let Statistician 2 manipulate you, right?

Well, P(H | coin A) = 2/3, and P(H| | coin B) = 1/3, so clearly "first flip heads" is an event that happens twice as often when it's coin A. What kind of scientist would you be if you couldn't derive evidence from an event that happens twice as often under some conditions?

The weird thing is that even though you can see the event "first flip heads," you'll never see the event "first flip tails." How come these individual data points are still "good bets," even though you'll never see the event of first flip tails? It seems like Statistician 2 has a sure-fire system for "beating the house" and convincing you no matter what.

Why am I suddenly making gambling analogies? Because Statistician 2 is trying to use a Martingale betting system. And at the end of the day, the house always wins - Statistician 2 has a large chance to submit a "biased towards heads" sample, but only at the cost of having their other samples be even more biased towards tails. On average, they are still accurate, just like how on average, you can't win money with a Martingale betting strategy.

In this analogy, publication bias is like running away without paying your gambling debts.

comment by IlyaShpitser · 2015-08-03T14:30:51.867Z · LW(p) · GW(p)

Was the professor in question Jamie?

Did you read Jamie's and Larry's counterexample where they construct a case where the propensity score is known exactly but the treatment/baseline/outcome model is too complex to bother w/ likelihood methods?

https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

Couldn't we extend this to longitudinal settings and just say MSMs are better than the parametric g-formula if the models for the latter are too complex? Would this not render the strong likelihood principle false? If you don't think causal inference problems are in the "right magisterium" for the likelihood principle, just consider missing data problems instead (same issues arise, in fact their counterexample is phrased as missing data).

Replies from: snarles, Manfred
comment by snarles · 2015-08-05T16:40:29.159Z · LW(p) · GW(p)

This is an interesting counterexample, and I agree with Larry that using priors which depend on pi(x) is really no Bayesian solution at all. But if this example is really so problematic for Bayesian inference, can one give an explicit example of some function theta(x) for which no reasonable Bayesian prior is consistent? I would guess that only extremely pathological and unrealistic examples theta(x) would cause trouble for Bayesians. What I notice about many of these "Bayesian non-consistency" examples is that they require consistency over very large function classes: hence they shouldn't really scare a subjective Bayesian who knows that any function you might encounter in the real world would be much better behaved.

In terms of practicality, it's certainly inconvenient to have to compute a non-parametric posterior just to do inference on a single real parameter phi. To me, the two practical aspects of actually specifying priors and actually computing the posterior remain the only real weakness of the subjective Bayesian approach (or the Likelihood principle more generally.)

PS: Perhaps it's worth discussing this example as its own thread.

comment by Manfred · 2015-08-03T18:15:20.520Z · LW(p) · GW(p)

It's not obvious to me that they got the Bayesian analysis right in that blog post. If you can have "no observation" for Y, it seems like what we actually observe is some Y' that can take on the values {0,1,null}, and the probability distribution over our observations of the variables (X,R,Y') is p(X) * P(R|X) * P(Y'|X,R).

EDIT: Never mind, it's not a problem. Even if it was, it wouldn't have changed their case that the Bayesian update won't give you this "uniform consistency" property. Which seems like something worth looking into.

As for this "low information" bull-hockey, let us put a MML prior over theta(x) and never speak of it again.

comment by Vaniver · 2015-08-05T18:01:12.542Z · LW(p) · GW(p)

I'm going to approach this from a slightly different perspective than Manfred, which may or may not help at updating your intuitions. (It boils down to the same argument, but with less example and more terminology.)

The stopping rule gives you no information above the full data. That is, not "12 heads and 8 tails", which is just a summary statistic, but a twenty character string of "T"s and "H"s. The stopping rule can give you information about the full data, given the summary statistic. If I know that Statistician 2 got 12 heads and 8 tails, I can rule out some of the sequences of 20 flips that have 12 heads and 8 tails, because I know that statistician 2 never would have gotten them, because they would have stopped midway.

(This is what Manfred illustrated by highlighting that the only possible sequences are THT and TTH, even though HTT has the same summary statistic.)

comment by ike · 2015-08-03T00:00:06.128Z · LW(p) · GW(p)

You need to condition on the actual sequence, not just the total number of heads and tails. If you redo the math with a specific sequence, you'll find the stopping rule doesn't matter.

The reason is that one statistician has more possible sequences with the same number of flips than the other, so conditioning only on number of flips gives different inferences.

comment by DanielLC · 2015-08-03T05:44:22.746Z · LW(p) · GW(p)

Here's how I look at it. Suppose you want to prove A, so you look for evidence until either you can prove it for p = 0.05, or it's definitely false. Let E be this experiment proving A, and !E be disproving it. P(A|E) = 0.95, and P(A|!E) = 0. Let's assume the prior for A is P(A) = 0.5.

P(A|E) = 0.95

P(A|!E) = 0

P(A) = 0.5

By conservation of expected evidence, P(A|E)P(E) + P(A|!E)P(!E) = P(A) = 0.5

0.95 P(E) = 0.5

P(E) = 0.526

So the experiment is more likely to succeed than fail. Even though A has even odds of being true, you can prove it more than half the time. It sounds like you're cheating somehow, but the thing to remember is that there are false positives but no false negatives. All you're doing is proving probably A more than definitely not A, and probably A is more likely.

But P(A|E) = 0.05. That was an assumption here. Had the probability been different, P(E) would have been different.

comment by solipsist · 2015-08-03T02:26:41.498Z · LW(p) · GW(p)

Thank you for writing this post. I've had the same intuition as you about stopping rules for over a year now, but have never taken time to sit down and work it out. I look forward to working through the comments!