# Don't You Care If It Works? - Part 1

post by Jacobian · 2015-07-29T14:32:14.045Z · score: 6 (14 votes) · LW · GW · Legacy · 60 comments## Contents

Part 1 - Epistemic Prologue - other people Caveat lector et scriptor Disagree to disagree Authority screens off that other authority you heard from afterwards The good book To be continued None 60 comments

# Part 1 - Epistemic

## Prologue - other people

Psychologists at Harvard showed that most people have implicit biases about several groups. Some other Harvard psychologists were subjects of this study proving that psychologists undervalue CVs with female names. All Harvard psychologists have probably heard about the effect of black names on resumes since even we have. Surely every psychology department in this country starting with Harvard will only review CVs with the names removed? Fat chance.

## Caveat lector et scriptor

A couple weeks ago I wrote a poem that makes aspiring rationalists feel better about themselves. Today I'm going to undo that. Disclaimers: This is written with my charity meter set to 5%. Every other paragraph is generalizing from anecdotes and typical-mind-fallacying. A lot of the points I make were made before and better. You should really close this tab and read those other links instead, I won't judge you. I'm not going to write in an academic style with a bibliography at the end, I'm going to write in the sarcastic style my blog would have if I weren't too lazy to start one. I'm also not trying to prove any strong empirical claims, this is BYOE: bring your own evidence. Imagine every sentence starting with "I could be totally wrong" if it makes it more digestible. Inasmuch as any accusations in this post are applicable, they apply to me as well. My goal is to get you worried, because I'm worried. If you read this and you're not worried, you should be. If you are, good!

## Disagree to disagree

*Edit: in the next paragraph, "Bob" was originally an investment advisor. My thanks to 2irons and Eliezer who pointed out why this is literally the worst example of a job I could give to argue my point.*

Is 149 a prime? Take as long as you need to convince yourself (by math or by Google) that it is. Is it unreasonable to have 99.9...% confidence with quite a few nines (and an occasional 7) in there? Now let's say that you have a tax accountant, Bob, a decent guy that seems to be doing a decent job filing your taxes. You start chatting with Bob and he reveals that he's pretty sure that 149 isn't a prime. He doesn't know two numbers whose product is 149, it just feels unprimely to him. You try to reason with him, but he just chides you for being so arrogant in your confidence: can't you just agree to disagree on this one? It's not like either of you is a numbers theorist. His job is to not get you audited by the IRS, which he does, not factorize numbers. Are you a little bit worried about trusting Bob with your taxes? What if he actually claimed to be a mathematician?

A few weeks ago I started reading beautiful probability and immediately thought that Eliezer is wrong about the stopping rule mattering to inference. I dropped everything and spent the next three hours convincing myself that the stopping rule doesn't matter and I agree with Jaynes and Eliezer. As luck would have it, soon after that the stopping rule question was the topic of discussion at our local LW meetup. A couple people agreed with me and a couple didn't and tried to prove it with math, but most of the room seemed to hold a third opinion: they disagreed but didn't care to find out. I found that position quite mind-boggling. Ostensibly, most people are in that room because we read the sequences and thought that this EWOR (Eliezer's Way Of Rationality) thing is pretty cool. EWOR is an epistemology based on the mathematical rules of probability, and the dude who came up with it apparently does mathematics for a living trying to save the world. *It doesn't seem like a stretch to think that if you disagree with Eliezer on a question of probability math*, a question that he considers so obvious it requires no explanation, *that's a big frickin' deal!*

## Authority screens off that other authority you heard from afterwards

This is a chart that I made because I got excited about learning ggplot2 in R. On the right side of the chart are a lot bright red dots below the very top who believe in MIRI but also read the quantum physics sequence and don't think that MWI is very likely. Some of them understood the question of P(MWI) to be about whether MWI is the one and only exact truth, but I'm sure that several of them read it the way I did, roughly as: 1-P(collapse is true given current evidence). A lot of these people are congratulating themselves on avoiding cultishness. In the comments they mention other bloggers (or maybe even physicists!) who think that collapse is totally Beatles and MWI is Bieber.

Hold on, why did Eliezer even take all this time to write a huge quantum physics sequence? Here's how I see it: It's not to settle a point about some scientific dispute. It's to show that a person with some spare time and a little above average IQ can figure out something beyond people of massive authority, on the very subject of their authority, even if that subjects is as scary as QM. You know what’s the median IQ of the dots on above chart? 139 points! 139!!! And you know who has spare time? Someone who read 100% of the sequences!

Why should you even care about this? Because Eliezer does. He's so fanatically confident about MWI not for its own sake, it's because he's saying (at least to my ears): "This is my EWOR epistemology. It may not make you skinny and rich, but if it will do just one thing for you it will prevent you from being very very confident in beliefs that could be proven wrong. So if I'm very very confident about a question of science or math, I'm sticking my neck out and betting my entire epistemology on being right, so you can check it for yourself". In fact, Eliezer will totally judge *your* epistemology based on your belief in MWI. You won't do the same to him when he asks you to?

I'm a bright red dot on the chart above. Here are two statements I believe from authority because it is beyond my cognitive powers and domain knowledge to assess them as arguments: "strong AI will break out of any box humans can design", "CEV is the best basis for programming morality in AI". I believe them because I tested Eliezer on stopping rules and MWI and 20 other things, and came away convinced each time. My confidence of 98% in MWI is an upper bound on my confidence in whatever Eliezer tells me about anything. If I became convinced that collapse is correct or that stopping rules affect Bayesian inference I'll email Nate Soares at once and ask for a refund on my MIRI donation.

By the way, this would be a good time to stop reading and give MIRI some money. That’s OK, I'll wait.

## The good book

A man comes to you with a book and says that it will grant you the only wisdom worth having, and as a side effect it may save your eternal soul. You read the book cover to cover and decide that the ideas you thought are nice are probably true, the ones that you didn't aren't, and you really like the bit with horses. Everyone on LW makes fun of you for claiming to take seriously something you don’t. Y’all see where this is going, don't you? Yes, it's fun to read the sequences for the "insight porn". It's also fun to read the Old Testament for the porn porn. But, maybe it could be more? Wouldn't it be kinda cool if you could read a book and become an epistemic superman, showing up experts wrong in their own domains and being proven right? Or maybe some important questions are going to come up in your life and you'll need to know the actual true answers? Or at least some questions you can bet $20 on with your friends and win?

Don't you want to know if this thing even works?

## To be continued

Part 2 is here. In it: whining is ceased, arguments are argued about, motivations are explained, love is found, and points are taken.

## 60 comments

Comments sorted by top scores.

It is true that optional stopping won't change Bayes rule updates (which is easy enough to show). It's also true that optional stopping does affect frequentist tests (different sampling distributions). The broader question is "which behavior is better?"

p-hacking is when statisticians use optional stopping to make their results look more significant (by not reporting their stopping rule). As it turns out you in fact can "posterior hack" Bayesians - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2374040

Edit: Also Debrah Mayo's Error Statistics book contains a demonstration that optional stopping can cause a Bayesian to construct confidence interval that never contain the true parameter value. Weirdly, those Bayesians can be posterior hacked even if you tell them about the stopping rule, because they don't think it matters.

p-hacking is when statisticians use optional stopping to make their results look more significant (by not reporting their stopping rule). As it turns out you in fact can "posterior hack" Bayesians - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2374040

That is not my understanding of the term "optional stopping" (nor, more significantly, is it that of Jaynes). Optional stopping is the process of collecting data, computing your preferred measure of resultiness as you go, and stopping the moment it passes your criterion for reporting it, whether that is p<0.05, or a Bayes factor above 3, or anything else. (If it never passes the criterion, you just never report it.) That is but one of the large arsenal of tools available to the p-hacker: computing multiple statistics from the data in the hope of finding one that passes the criterion, thinking up more hypotheses to test, selective inclusion or omission of "outliers", fitting a range of different models, and so on. And of these, optional stopping is surely the least effective, for as Jaynes remarks in "Probability Theory as Logic", it is practically impossible to sample long enough to produce substantial support for a hypothesis deviating substantially from the truth.

All of those other methods of p-hacking involve concealing the real hypothesis, which is the collection of all the hypotheses that were measured against the data. It is like dealing a bridge hand and showing that it supports astoundingly well the hypothesis that *that* bridge hand would be dealt. In machine learning terms, the hypothesis is being covertly trained on the data, then tested on how well it fits the data. No measure of the latter, whether frequentist or Bayesian, is a measure of how well the hypothesis will fit new data.

If you look at the paper, what you call optional stopping is what the authors called "data peeking."

In their simulations, the authors first took in a sample of 20 and calculated it, and then could selectively continue to add data up to 30 (stopping when they reach "effect" or 30 samples). The papers point is that this does skew the Bayes factor (doubles the chances of managing to get a Bayes factor > 3).

It skews the Bayes factor *when the hypothesis is in fact not true*. The times that the hypothesis is true should balance out to make the calibration correct overall.

In practice what p-hacking is about is convincing the world of an effect, so you are trying to create bias toward any data looking like a novel effect. Stopping rules/data peeking accomplish this just as much for Bayes as for frequentist inference (though if the frequentist knows about the stopping rule they can adjust in a way that bayesians can't), which is my whole point.

Whether or not the Bayesian calibration is overall correct depends not just on the Bayes factor but the prior.

Whether or not the Bayesian calibration is overall correct depends not just on the Bayes factor but the prior.

It depends *only* on the prior. I consider all these "stopping rule paradoxes" disguised cases where you give the Bayesian a bad prior, and the frequentist formula encodes a better prior.

In practice what p-hacking is about is convincing the world of an effect, so you are trying to create bias toward any data looking like a novel effect.

You still wouldn't have more chances of showing a novel effect than you thought you would when you went into the experiment, if your priors are correct. If you say "I'll stop when I have a novel effect", do this many times, and then look at all the times you found a novel effect, 95% of the time the effect should actually be true. If this is wrong, you *must* have bad priors.

It depends only on the prior. I consider all these "stopping rule paradoxes" disguised cases where you give the Bayesian a bad prior, and the frequentist formula encodes a better prior.

Then you are doing a very confusing thing that isn't likely to give much insight. Frequentist inference and Bayesian inference are different and it's useful to at least understand both ideas(even if you reject frequentism).

Frequentists are bounding their error with various forms of the law of large numbers, they aren't coherently integrating evidence. So saying the "frequentist encodes a better prior" is to miss the whole point of how frequentist statistics works.

And the point in the paper I linked has nothing to do with the prior, it's about the bayes factor, which is independent of the prior. Most people who advocate Bayesian statistics in experiments advocate sharing bayes factors, not posteriors in order to abstract away the problem of prior construction.

And the point in the paper I linked has nothing to do with the prior, it's about the bayes factor, which is independent of the prior.

Let me put it differently. Yes, your chance of getting a bayes factor of >3 is 1.8 with data peeking, as opposed to 1% without; but your chance of getting a higher factor also goes down, because you stop as soon as you reach 3. Your expected bayes factor is necessarily 1 weighted over your prior; you expect to find evidence for neither side. Changing the exact distribution of your results won't change that.

Your expected bayes factor is necessarily 1

Should that say, rather, that its expected log is zero? A factor of n being as likely as a factor of 1/n.

My original response to this was wrong and has been deleted

I don't think this has anything to do with logs, but rather that it is about the difference between probabilities and odds. Specifically, the Bayes factor works on the odds scale but the proof for conservation of expected evidence is on the regular probability scale

If you consider the posterior under all possible outcomes of the experiment, the ratio of the posterior probability to the prior probability will on average be 1 (when weighted by the probability of the outcome under your prior). However, the ratio of the posterior probability to the prior probability is not the same thing as the Bayes factor.

If you multiply the Bayes factor by the prior odds, and then transform the resulting quantity (ie the posterior) from the odds scale to a probability, and then divide by the prior probability, the resulting quantity will on average be 1

However, this is too complicated and doesn't seem like a property that gives any additional insight on the Bayes factor..

That's probably a better way of putting it. I'm trying to intuitively capture the idea of "no expected evidence", you can frame that in multiple ways.

Huh? E[X] = 1 and E[\log(X)] = 0 are two very different claims; which one are you actually claiming?

Also, what is the expectation with respect to? Your prior or the data distribution or something else?

I'm claiming the second. I was framing it in my mind as "on average, the factor will be 1", but the kind of "average" required is the average log on further thought. I should probably use log in the future for statements like that.

Also, what is the expectation with respect to?

The prior.

This seems wrong then. Imagine you have two hypotheses, which you place equal probability on but then will see an observation that definitively selects one of the two as correct. E[p(x)] = 1/2 both before and after the observation, but E[log p (x)] is -1 vs - infinity.

In that case, your Bayes Factor will be either 2/0, or 0/2.

Log of the first is infinity, log of the second is negative infinity.

The average of those two numbers is *insert handwave here* 0.

(If you use the formula for log of divisions, this actually works).

Replace 1/2 and 1/2 in the prior with 1/3 and 2/3, and I don't think you can make them cancel anymore.

I think we need to use actual limits then, instead of handwaving infinities. So let's say the posterior for the unfavored hypothesis is e->0 (and is the same for both sides). The Bayes factor for the first hypothesis being confirmed is then (1-e)*3/(3e/2), which http://www.wolframalpha.com/input/?i=%281-e%29*3%2F%283e%2F2%29 simplifies to 2/e - 2. The Bayes factor for the second being confirmed is 3e/((1-e)3/2), which is again simplified http://www.wolframalpha.com/input/?i=3e%2F%28%281-e%293%2F2%29 to (2e)/(1-e).

Now, let me digress and derive the probability of finding evidence for each hypothesis; it's almost but not quite 1/3:2/3. There's a prior of 1/3 of the first hypothesis being true; this must equal the weighted expectation of the posteriors, by conservation of evidence. So if we call x the chance of finding evidence for hypothesis one, then x*(1-e)+(1-x)*e must equal 1/3. http://www.wolframalpha.com/input/?i=x*%281-e%29%2B%281-x%29*e%3D1%2F3+solve+for+x solves

x = (1-3 e)/(3-6 e)

which as a sanity check, does in fact head towards 1/3 as e goes towards 0. The corresponding probability of finding evidence for the second hypothesis is 1-x=(2-3 e)/(3-6 e).

Getting back to expected logs of Bayes factors, the chance of getting a bayes factor of 2/e - 2 is (1-3 e)/(3-6 e), while the chance of getting (2e)/(1-e) is (2-3 e)/(3-6 e).

Log of the first, times its probability, plus log of the second, times its probability, is http://www.wolframalpha.com/input/?i=log+%282%2Fx+-+2%29*+%281-3+x%29%2F%283-6+x%29%2Blog%28%282x%29%2F%281-x%29%29*+%282-3+x%29%2F%283-6+x%29%2Cx%3D.01 not zero.

Hm. I'll need to think this over, this wasn't what I expected. Either I made some mistake, or am misunderstanding something here. Let me think on this for a bit.

Hopefully I'll update this soon with an answer.

I think it's not going to work out. The expected posterior is equal to the prior, but the expected log Bayes factor will have the form p log(K1) + (1-p) log(K2), which for general p is just a mess. Only when p=1/2 does it simplify to log(K1 K2), and when p=1/2, K2=1/K1, so the whole thing is zero.

Okay, so I think I worked out where my failed intuition got it from. The Bayes facter is the ratio of posterior/prior for hypothesis a, divided by the ratio for hypothesis B. The top of that is expected to be 1 (because the expected posterior over the prior is one, factoring out the prior in each case keeps that fraction constant), and the bottom is also (same argument), but the expected ratio of two numbers expected to be one is not always one. So my brain turned "denominator and numerator one" into "ratio one".

I think it's not going to work out. The expected posterior is equal to the prior, but the expected log Bayes factor will have the form p log(K1) + (1-p) log(K2), which for general p is just a mess. Only when p=1/2 does it simplify to log(K1 K2), and when p=1/2, K2=1/K1, so the whole thing is zero.

Your expected bayes factor is necessarily 1 weighted over your prior; you expect to find evidence for neither side.

I think this claim is correct on the natural scale except it should be weighted over probability of the data, not weighted over the prior. The margin of this comment is too small to contain the proof, so I'll put a pdf in my public drop box folder at https://www.dropbox.com/s/vmom25u9ic7redu/Proof.pdf?dl=0

(I am slightly out of my depth here, I am not a mathematician or a Bayesian theorist, so I reserve the right to delete this comment if someone spots a flaw)

Then you are doing a very confusing thing that isn't likely to give much insight. Frequentist inference and Bayesian inference are different and it's useful to at least understand both ideas(even if you reject frequentism).

I think I understand frequentism. My claim here was that the specific claim of "the stopping rule paradox proves that frequentism does better than Bayes" is wrong, or is no stronger than the standard objection that Bayes relies on having good priors.

So saying the "frequentist encodes a better prior" is to miss the whole point of how frequentist statistics works.

What I meant is that you can get the same results as the frequentist in the stopping rule case if you adopt a particular prior. I might not be able to show that rigorously, though.

And the point in the paper I linked has nothing to do with the prior, it's about the bayes factor, which is independent of the prior.

That paper only calculates what happens to the bayes factor when the null is true. There's nothing that implies the inference will be wrong.

There are a couple different version of the stopping rule cases. Some are disguised priors, and some don't affect calibration/inference or any Bayesian metrics.

That paper only calculates what happens to the bayes factor when the null is true. There's nothing that implies the inference will be wrong.

That is the practical problem for statistics (the null is true, but the experimenter desperately wants it to be false). Everyone wants their experiment to be a success. The goal of this particular form of p-hacking is to increase the chance that you get a publishable result. The goal of the p-hacker is to increase the probability of type 1 error. A publication rule based on Bayes factors instead of p-values is still susceptible to optional stopping.

You seem to be saying that a rule based on posteriors would not be susceptible to such hacking?

You seem to be saying that a rule based on posteriors would not be susceptible to such hacking?

I'm saying that all inferences are still correct. So if your prior is correct/well calibrated, then your posterior is as well. If you end up with 100 studies that all found an effect for different things at a posterior of 95%, 5% of them should be wrong.

The goal of the p-hacker is to increase the probability of type 1 error.

So what I should say is that the Bayesian doesn't care about the frequency of type 1 errors. If you're going to criticise that, you can do so without regard to stopping rules. I gave an example in a different reply of hacking bayes factors, now I'll give one with hacking posteriors:

Two kinds of coins: one fair, one 10%H/90%T. There are 1 billion of the fair ones, and 1 of the other kind. You take a coin, flip it 10 times, then say which coin you think it is. The Bayesian gets the biased coin, and no matter what he flips, will conclude that the coin is fair with overwhelming probability. The frequentist gets the coin, get ~9 tails, and says "no way is this fair". There, the frequentist does better because the Bayesian's prior is bad (I said there are a billion fair ones and only one biased one, but only looked at the biased ones).

It doesn't matter if you always conclude with 95% posterior that the null is false when it is true, as long as you have 20 times as many cases that the null is actually false. Yes, this opens you up to being tricked; but if you're worried about deliberate deception, you should include a prior over that. If you're worried about publication bias when reading other studies, include a prior over that, etc.

I'm saying that all inferences are still correct. So if your prior is correct/well calibrated, then your posterior is as well. If you end up with 100 studies that all found an effect for different things at a posterior of 95%, 5% of them should be wrong.

But that is based on the posterior.

When I ask for clarification, you seem to be doing two things:

- changing the subject to posteriors
- asserting that a perfect prior leads to a perfect posterior.

I think 2 is uncontroversial, other than if you have a perfect prior why do any experiment at all? But it is also not what is being discussed. The issue is that with optional stopping you bias the Bayes factor.

As another poster mentioned, expected evidence is conserved. So let's think of this like a frequentist who has a laboratory full of bayesians in cages. Each Bayesian gets one set of data collected via a standard protocol. Without optional stopping, most of the Bayesians get similar evidence, and they all do roughly the same updates.

With optional stopping, you'll create either short sets of stopped data that support the favored hypothesis or very long sets of data that fail to support the favored hypothesis. So you might be able to create a rule that fools 99 out of the 100 Bayesians, but the remaining Baysian is going to be very strongly convinced of the disfavored hypothesis.

Where the Bayesian wins over the frequentist is that if you let the Bayesians out of the cages to talk, and they share their likelihood ratios, they can coherently combine evidence and the 1 correct Bayesian will convince all the incorrect Bayesians of the proper update. With frequentists, fewer will be fooled, but there isn't a coherent way to combine the confidence intervals.

So the issue for scientists writing papers is that if you are a Bayesian adopt the second, optional stopped experimental protocol (lets say it really can fool 99 out of 100 Bayesians) then at least 99 out of 100 of the experiments you run will be a success (some of the effects really will be real). The 1/100 that fails miserably doesn't have to be published.

Even if it is published, if two experimentalists both average to the truth, the one who paints most of his results as experimental successes probably goes further in his career.

With frequentists, fewer will be fooled, but there isn't a coherent way to combine the confidence intervals.

Can't frequentists just pool their *data* and then generate a new confidence interval from the supersized sample?

I think 2 is uncontroversial, other than if you have a perfect prior why do any experiment at all?

By perfect I mean well calibrated. I don't see why knowing that your priors in general are well calibrated implies that more information doesn't have positive expected utility.

The issue is that with optional stopping you bias the Bayes factor.

Only in some cases, and only with regard to someone who knows more than the Bayesian. The Bayesian himself can't predict that the factor will be biased; the expected factor should be 1. It's only someone who knows better that can predict this.

So let's think of this like a frequentist who has a laboratory full of bayesians in cages.

Before I analyse this case, can you clarify whether the hypothesis happens to be true, false, or chosen at random? Also give these Bayesians' priors, and perhaps an example of the rule you'd use.

Before I analyse this case, can you clarify whether the hypothesis happens to be true, false, or chosen at random? Also give these Bayesians' priors, and perhaps an example of the rule you'd use.

Again, the prior doesn't matter, they are computing Bayes factors. We are talking about Bayes factors. Bayes factors. Prior doesn't matter. Bayes factors. Prior.Doesn't.Matter. Bayes factors. Prior.Doesn't.Matter. Bayes.factor.

Let's say the null is true, but the frequentist mastermind has devised some data generating process that (let's say he has infinite data at his disposal) that can produce evidence in favor of competing hypothesis at a Bayes factor of 3, 99% of the time.

Again, the prior doesn't matter, they are computing Bayes factors.

It matters here, because you said "So you might be able to create a rule that fools 99 out of the 100 Bayesians". The probability of getting data given a certain rule depends on which hypothesis is true, and if we're assuming the hypothesis is like the prior, then we need to know the prior to calculate those numbers.

Let's say the null is true, but the frequentist mastermind has devised some data generating process that (let's say he has infinite data at his disposal) that can produce evidence in favor of competing hypothesis at a Bayes factor of 3, 99% of the time.

That's impossible. http://doingbayesiandataanalysis.blogspot.com/2013/11/optional-stopping-in-data-collection-p.html goes through the math.

Using either Bayesian HDI with ROPE, or a Bayes factor, the false alarm rate asymptotes at a level far less than 100% (e.g., 20-25%). In other words, using Bayesian methods, the null hypothesis is accepted when it is true, even with sequential testing of every datum, perhaps 75-80% of the time.

In fact, you can show easily that this can succeed at most 33% of the time. By definition, the Bayes factor is how likely the data is given one hypothesis, divided by how likely the data is given the other. The data in the class "results in a bayes factor of 3 against the null" has a certain chance of happening given that the null is true, say p. This class of course contains many individual mutually exclusive sets of data, each with a far lower probability, but they sum to p. Now, the chance of this class of possible data sets happening given that the null is not true has an upper bound of 1. Each individual probability (which collectively sum to at most 1) must be 3 times as much as the corresponding probability in the group that sums to p. Ergo, p is upper bounded by 33%.

I think this is problem dependent.

In simulation, I start to asymptote to around 20%, with a coin flip, but estimating mean from a normal distribution (with the null being 0) with fixed variance I keep climbing indefinitely. If you are willing to sample literally forever it seems like you'd be able to convince the Bayesian that the mean is not 0 with arbitrary Bayes factor. So for large enough N in a sample, I expect you can get a factor of 3 for 99/100 of the Bayesians in cages (so long as that last Bayesian is really, really sure the value is 0).

But it doesn't change the results if we switch and say we fool 33% of the Bayesians with Bayes factor of 3. We are still fooling them.

If you are willing to sample literally forever it seems like you'd be able to convince the Bayesian that the mean is not 0 with arbitrary Bayes factor.

No, there's a limit on that as well. See http://www.ejwagenmakers.com/2007/StoppingRuleAppendix.pdf

Instead, as pointed out by Edwards et al. (1963, p. 239): “(...) if you set out to collect data until your posterior probability for a hypothesis which unknown to you is true has been reduced to .01, then 99 times out of 100 you will never make it, no matter how many data you, or your children after you, may collect (...)”.

If you can generate arbitrarily high Bayes factors, then you can reduce your posterior to .01, which means that it can only happen 1 in 100 times. You can never have a guarantee of always getting strong evidence for a false hypothesis. If you find a case that does, it will be new to me and probably change my mind.

But it doesn't change the results if we switch and say we fool 33% of the Bayesians with Bayes factor of 3. We are still fooling them.

That doesn't concern me. I'm not going to argue for why, I'll just point out that if it is a problem, it has absolutely nothing to do with optional stopping. The exact same behavior (probability 1/3 of generating a Bayes factor of 3 in favor of a false hypothesis) shows up in the following case: a coin either always lands on heads, or lands on heads 1/3 of the time and tails 2/3 of the time. I flip the coin a single time. Let's say the coin is the second coin. There's a 33% chance of getting heads, which would produce a Bayes factor of 3 in favor of the 100%H coin.

If there's something wrong with that, it's a problem with classic Bayes, *not* optional stopping.

It is my thesis that every optional stopping so-called paradox can be converted into a form without optional stopping, and those will be clearer as to whether the problem is real or not.

No, there's a limit on that as well. See http://www.ejwagenmakers.com/2007/StoppingRuleAppendix.pdf

I can check my simulation for bugs. I don't have the referenced textbook to check the result being suggested.

It is my thesis that every optional stopping so-called paradox can be converted into a form without optional stopping, and those will be clearer as to whether the problem is real or not.

The first part of this is trivially true. Replace the original distribution with the sampling distribution from the stopped problem, and it's not longer a stopped problem, it's normal pulls from that sampling distribution.

I'm not sure it's more clear,I think it is not. Your "remapped" problem makes it look like it's a result of low-data-volume and not a problem of how the sampling distribution was actually constructed.

I don't have the referenced textbook to check the result being suggested.

You can see http://projecteuclid.org/euclid.aoms/1177704038, which proves the result.

Replace the original distribution with the sampling distribution from the stopped problem, and it's not longer a stopped problem, it's normal pulls from that sampling distribution.

How would this affect a frequentist?

I'm not sure it's more clear,I think it is not. Your "remapped" problem makes it look like it's a result of low-data-volume and not a problem of how the sampling distribution was actually constructed.

I'm giving low data because those are the simplest kinds of cases to think of. If you had lots of data with the same distribution/likelihood, it would be the same. I leave it as an exercise to find a case with lots of data and the same underlying distribution ...

I was mainly trying to convince you that nothing's actually wrong with having 33% false positive rate in contrived cases.

How would this affect a frequentist?

It doesn't the frequentist is already measuring with the sample distribution. That is how frequentism works.

I was mainly trying to convince you that nothing's actually wrong with having 33% false positive rate in contrived cases.

I mean it's not "wrong" but if you care about false positive rates and there is a method had has a 5% false positive rate, wouldn't you want to use that instead?

If for some reason low false positive rates were important, sure. If it's important enough to give up consistency.

I comment on this discussion here: http://lesswrong.com/r/discussion/lw/mke/on_stopping_rules/

Since I don't want this to spiral into another stopping rule argument, allow me to try and dissolve a confusing point that the discussions get stuck on.

What makes Bayesian "lose" in the cases proposed by Mayo and Simonsohn isn't the inference, it's the scoring rule. A Bayesian scores himself on total calibration, "number of times my 95% confidence interval includes the truth" is just a small part of it. You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio **against** that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian's reward. If you rig the rules to view something about 95% confidence intervals as the only measure of success, then the frequentist's *decision* *rule* about accepting hypotheses at a 5% p-value wins, it's not his inference that magically becomes superior.

Allow me to steal an analogy from my friend Simon: I'm running a Bayesian Casino in Vegas. Debrah Mayo comes to my casino every day with $31. She bets $1 on a coin flip, then bets $2 if she loses, then $4 and so on until she either wins $1 or loses all $31 if 5 flips go against her. I obviously think that by conservation of expected money in a coin flip this deal is fair, but Prof. Mayo tells me that I'm a sucker because I lose more days that I win. I tell her that I care about dollars, not days, but she replies that if she had more money in her pocket, she could make sure I have a losing day with arbitrarily high probability! I smile and ask her if she wants a drink.

You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis.

This is wrong, unless I've misunderstood you. Imagine the prior for hypothesis H is p, hence the prior for ~H is 1-p. If you have a 99% chance of generating a 20:1 likelihood for H, then your prior must be bounded below by .99*(20p/19p+1). (The second term is the posterior for H if you have a 20:1 likelihood). So we have the inequality p> .99*(20p/19p+1), which I was lazy and used http://www.wolframalpha.com/input/?i=p%3E+.99*%2820p%29%2F%2819p%2B1%29%2C+0%3Cp%3C1 to solve, which tells me that p must be at least 0.989474.

So you can only expect to generate strong evidence for a hypothesis if you're already pretty sure of it, which is just as it should be.

I may have bungled these calculations, doing them quickly, though.

Edit: removed for misunderstanding ike's question and giving an irrelevant answer. Huge thanks to ike for teaching me math.

That's exactly what I used it for in my calculation, I didn't misunderstand that. Your computation of "conservation of expected evidence" simply does not work unless your prior is extremely high to begin with. Put simply, you cannot be 99% sure that you'll later change your current belief in H of p to anything greater than 100*p/99, which places a severe lower bound on p for a likelihood ratio of 20:1.

Yes! It worked! I learned something by getting embarrassed online!!!

ike, you're absolutely correct. I applied conservation of expected evidence to likelihood ratios instead of to posterior *probabilities*, and thus didn't realize that the prior puts bounds on expected likelihood ratios. This also means that the numbers I suggested (1% of 1:2000, 99% of 20:1) define the prior precisely at 98.997%.

I'm going to leave the fight to defend the reputation of Bayesian inference to you and go do some math exercises.

A Bayesian scores himself on total calibration, "number of times my 95% confidence interval includes the truth" is just a small part of it. You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian's reward.

Hold on. Let's say I hire a Bayesian statistician to produce some estimate for me. I do not care about "scoring" or "reward", all I care about is my estimate and how accurate it is. Now you are going to tell me that in 99% of the cases your estimate will be wrong *and that's fine* because there is a slight chance that you'll be really really sure of the opposite conclusion?

I'm running a Bayesian Casino in Vegas. Debrah Mayo comes to my casino every day with $31.

Why, that's such a *frequentist* approach X-/

Let's change the situation slightly. You are running the Bayesian Casino and Debrah Mayo comes to you casino **once** with, say, $1023 in her pocket. Will I lend you money to bet against her? No, I will not. The distribution matters beyond simple expected means.

Reminds of this bit from a Wasserman paper http://ba.stat.cmu.edu/journal/2006/vol01/issue03/wasserman.pdf

van Nostrand: Of course. I remember each problem quite clearly. And I recall that on each occasion I was quite thorough. I interrogated you in detail, determined your model and prior and produced a coherent 95 percent interval for the quantity of interest.

Pennypacker: Yes indeed. We did this many times and I paid you quite handsomely.

van Nostrand: Well earned money I’d say. And it helped win you that Nobel.

Pennypacker: Well they retracted the Nobel and they took away my retirement savings.

... van Nostrand: Whatever are you talking about?

Pennypacker: You see, physics has really advanced. All those quantities I estimated have now been measured to great precision. Of those thousands of 95 percent intervals, only 3 percent contained the true values! They concluded I was a fraud.

van Nostrand: Pennypacker you fool. I never said those intervals would contain the truth 95 percent of the time. I guaranteed coherence not coverage!

Now you are going to tell me that in 99% of the cases your estimate will be wrong

No. Your calibration is still perfect if your priors are perfect. You can only get to that "99% chance of getting strong evidence for hypothesis" if you're already very sure of that hypothesis math here

What makes Bayesian "lose" in the cases proposed by Mayo and Simonsohn isn't the inference, it's the scoring rule. A Bayesian scores himself on total calibration, "number of times my 95% confidence interval includes the truth" is just a small part of it. You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian's reward. If you rig the rules to view something about 95% confidence intervals as the only measure of success, then the frequentist's decision rule about accepting hypotheses at a 5% p-value wins, it's not his inference that magically becomes superior.

Sometimes we might care about "total calibration" I guess, but sometimes we care about being actually calibrated in the rationalist sense. Sometimes we want a 95% confidence interval to mean that doing this 100 times will include the true value about 95 times.

My point was this idea that the stopping rule doesn't matter is more complicated than calculating a Bayes factor and saying "look, the stopping rule doesn't change the Bayes factor."

My point was this idea that the stopping rule doesn't matter is more complicated than calculating a Bayes factor and saying "look, the stopping rule doesn't change the Bayes factor."

The stopping rule won't change the expectation of the Bayes factor.

Sometimes we want a 95% confidence interval to mean that doing this 100 times will include the true value about 95 times.

If your prior is correct, then your 95% credibility interval will, in fact, be well calibrated and be correct 95% of the time. I argued at length on tumblr that most or all of the force of the stopping rule objection to Bayes is a disguised "you have a bad prior" situation. If you're willing to ask the question that way, you can generate similar cases without stopping rules as well. For instance, imagine there are two kinds of coins; ones that land on heads 100% of the time, and ones that land on heads 20% of the time. (The rest are tails.) You have one flip with the coin. Oh, one more thing: I tell you that there are 1 billion coins of the first kind, and only one of the second kind.

You flip the coin once. It's easy to show that there's an overwhelming probability of getting a 20:1 likelihood in favor of the first coin. Why is this problematic?

I can and have given a similar case for 95% credibility intervals as opposed to Bayes factors, which I'll write out if you're interested.

"Does this change your confidence in Bob managing your retirement investments?"

- if he never held himself out as a quant based investor or using a method reliant on much analytical or quantitative research I wouldn't worry about it.

Maybe he's been good at choosing ETF's because he's great at listening to investors and trader chat - can feel which arguments are about to dominate the market and allocates capital accordingly. Maybe he sits by a high performing proprietory trading team in his bank and you're piggy backing off of all their trades at a fraction of the fee. As a fund manager I know several other managers who would have no hope of following most of the articles on this website, misunderstand probability at basic levels (this has been teased out by in depth conversations on things like card counting - where they are high conviction yet wrong) but yet who I'd still have to concede are likely to continue outperforming me in the market because they are great at the parts that count.

I think this is put best by Nassim Taleb in Anti-Fragile:

“In one of the rare noncharlatanic books in finance, descriptively called What I Learned Losing A Million Dollars, the protagonist makes a big discovery. He remarks that a fellow called Joe Siegel, the most active trader in a commodity called “green lumber” actually thought that it was lumber painted green (rather than freshly cut lumber, called green because it had not been dried). And he made a living, even a fortune trading the stuff! Meanwhile the narrator was into theories of what caused the price of commodities to move and went bust.

The fact is that predicting the orderflow in lumber and the price dynamics narrative had little to do with these details —not the same ting. Floor traders are selected in the most nonnarrative manner, just by evolution in the sense that nice arguments don’t make much difference.”

Perhaps I'm being a bit harsh focusing on an analogy but I think there might be a wider point. Producing the right or wrong answers in one domain isn't necessarily a useful predictor of someone's ability to produce the right or wrong answer in another - even when they are highly connected.

"Does somebody being right about X increase your confidence in their ability to earn excess returns on a liquid equity market?" has to be the worst possible question to ask about whether being right in one thing should increase your confidence about them being right elsewhere. Liquid markets are some of the hardest things in the entire world to outguess! Being right about MWI is *enormously* being easier than being right about what Microsoft stock will do relative to the rest of S&P 500 over the next 6 months.

There's a gotcha to the gotcha which is that you have to know from your own strength how hard the two problems are - financial markets are different from, e.g., the hard problem of conscious experience, in that we know exactly why it's hard to predict them, rather than just being confused. Lots of people don't realize that MWI is knowable. Nonetheless, going from MWI to Microsoft stock behavior is like going from 2 + 2 = 4 to MWI.

Lots of people don't realize that MWI is knowable.

What do you mean, "knowable"?

Showing that MWI is correct while other interpretations are not is straight-up Nobel material.

Showing that MWI is correct while other interpretations are not is straight-up Nobel material.

I don't think "MWI" is a useful abbreviation (unless you unpack it as "many world interpretation*s*" it implies it's singular, for example), and I'm not sure "correct" is the right word either. My understanding of Eliezer's argument is like so:

Interpretations of QM differ in how many postulates they have, and how difficult those postulates are to justify, but typically not in verifiable experimental predictions. (An interpretation that gets experimental predictions wrong should be dropped out of the running.)

One should have higher credence in simpler interpretations--that is, ones with fewer and less strange postulates.

The first major interpretation of QM adopted by mainstream science had additional costly postulates relative to later interpretations or the agnostic interpretation. That it is still taken seriously is evidence for the strength of first mover effects / the inadequacy of scientific philosophy in getting answers correct quickly.

We know from superposition that many quantum phenomena are better described by "branch both ways" than "branch one way xor the other," and supposing that the other branch is deleted rather than inaccessible requires an extra postulate, which should lower our credence in interpretations that have only one surviving world.

I don't think this argument implies that there is a unique surviving interpretation--but it will *never* be the case that there is a unique surviving interpretation, because I can take that interpretation and add an unobservable feature to it, creating a new interpretation. The principle by which those interpretations are reduced in prominence, when applied straightforwardly to QM, suggests that it's more likely that other branches of the wavefunction that we are not on continue to exist rather than disappearing because we are not there.

(Personally, I still identify with "shut up and calculate," where one refuses to answer statements about whether or not the other branches are real. But if one must conclude something, it's less jarring to conclude they are real than unreal.)

That all sounds very reasonable -- though I'm not as big a fan of William of Ockham as many people here -- but I still don't understand what did Eliezer mean by "knowable".

In the absence of a straightforward empirical interpretation ("knowable" = pokable, measurable, analysable...), I associate this word with mushy theology (along the "Christ is knowable through your heart" lines).

I still don't understand what did Eliezer mean by "knowable".

It's one thing to say "there are multiple competing interpretations," another to say "there is a simplest interpretation, and other, more burdensome variants," and still another to be able to point to the calculations that determine which explanation is simplest. I don't poll people on their interpretations of QM, but I am of the impression that many people who have opinions on QM interpretations aren't even aware that there's math that is useful for comparing interpretations. (Hence them not realizing that it's "knowable.")

There's also the point about lightness, and using the most likely explanation as the default explanation, instead of saying "sure, it's more likely, but not so much more likely enough that my pet theory is totally unacceptable in comparison, so I'm sticking with my pet theory."

aren't even aware that there's math that is useful for comparing interpretations

Notably, that "math" (and it's not just math) did not convince large chunks of the physics community -- people who are quite comfortable with numbers and who certainly know the "knowable" issue.

But this is an endless debate and it probably would be best not to step into the morass :-)

Upvoted for criticizing the actual point :)

I agree that Bob is probably a weak analogy because who knows how stock-pickers actually pick stocks? I hope I didn't construct a reverse intuition pump. Still, Eliezer's job is either a research mathematician or a philosopher of epistemology, depending on what matters to you. Both those jobs depend quite a bit on getting the rules of probability right. I think disagreeing with Eliezer on the rules of probability isn't something one can ignore.

You also hit an important point with the question: is Bob a quant researcher? Or more specifically, what's his epistemology in each domain? A few years ago I met Robert Aumann, everyone's favorite theist-mathematician. It's clear to me that Aumann has separate epistemologies for claims about God, Israeli politics and game theory. I read his book about games with incomplete information, he actually goes to the trouble of writing out every single proof, it never says "God told me this is the upper bound". If Aumann tells me "I've proven with rigorous math than an AI boxing system is possible", I sit up and pay attention. If he tells me "AIs don't have souls so humans will always be superior" I won't take that argument seriously.

Eliezer claims and seems to have a single epistemology (EWOR) that he applies to all domains. Because of that, his accuracy in one domain directly reflects on the dependability of EWOR, and thus on his accuracy in other domains.

Is 149 a prime? Take as long as you need to convince yourself (by math or by Google) that it is. Is it unreasonable to have 99.9...% confidence with quite a few nines (and an occasional 7) in there? Now let's say that you have a tax accountant, Bob, a decent guy that seems to be doing a decent job filing your taxes. You start chatting with Bob and he reveals that he's pretty sure that 149 isn't a prime. He doesn't know two numbers whose product is 149, it just feels unprimely to him. You try to reason with him, but he just chides you for being so arrogant in your confidence: can't you just agree to disagree on this one? It's not like either of you is a numbers theorist. His job is to not get you audited by the IRS, which he does, not factorize numbers. Are you a little bit worried about trusting Bob with your taxes? What if he actually claimed to be a mathematician?

You know, about that.

**[deleted]**· 2015-07-30T15:30:23.470Z · score: 2 (2 votes) · LW · GW

Wouldn't it be kinda cool if you could read a book and become an epistemic superman, showing up experts wrong in their own domains and being proven right?

When has this actually happened? I mean the part where just some guy with an above-average IQ reads the Sequences and starts *out-predicting* professional domain experts.

Also, regarding MWI: I don't regard it as even *meaningful*, preferring to declare that fundamental ontology is forever underdetermined by experimental data, and I should just *shut up and calculate*. There are researchers who try to do experiments that would distinguish one theory of quantum ontology from another -- I don't have a right to an opinion until I've seen their data.

It's to show that a person with some spare time and a little above average IQ can figure out something beyond people of massive authority, on the very subject of their authority, even if that subjects is as scary as QM

That's one theory. Another is that he read Deutsch, and found him persuasive, and read .Jaynes and found him persuasive, and noticed a nice consilience between MWIs determinism, and Jaynes's probability-is-in-the-mind theory,

By trusting Eliezer on MWI, aren't you trusting both his epistemology and his mathematical intuition?

Eliezer believes that the MWI interpretation allows you to derive quantum physics without any additional hypotheses that add complexity, such as collapse or the laws of movement for Bohm's particles. But this belief is based on mathematical intuition, according to the article on the Born probabilities. Nobody knows *how* to derive the observations without additional hypotheses, but a lot of people such as Eliezer conjecture it's possible. Right?

I feel like this point must have been made many times before, as Eliezer's quantum sequence has been widely discussed, so maybe instead of a response I need a link to a previous conversation or a summary of previous conclusions.

But relating it to the point of your article... If Eliezer is wrong about quantum mechanics, should that lower my probability that his other epistemological views are correct? This is important because it affects whether or not I bother learning those views. The answer is "yes but not extremely", because I think if there's an error, it may be in the mathematical intuition.

To generalize a bit, it's hard to find pure tests of a single ability. Though your example of stopping rules is actually a pretty good one, for understanding the meaning of all the probability symbols. But usually we should not be especially confused when someone with expertise is wrong about a single thing, since that single thing is probably not a pure test of that expertise. However we *should* be confused if on average they are wrong as many times as people without the expertise. Then we have to doubt the expertise or our own judgments of the right and wrong answers.

Nobody knows how to derive the observations without additional hypotheses, but a lot of people such as Eliezer conjecture it's possible. Right?

I would not expect it to be possible to derive the observations without additional postulates; I think that it's possible to do it with any of some partially known set of possible postulates, and the hunt is on for the most palatable postulate. At the time that the QM sequence was written, Eliezer was aware of multiple proposed solutions, none of which he found fully satisfying.

For example, consider this new argument whose additional postulate is a specific version of 'locality.' I don't know whether or not Eliezer finds that one satisfying (note that MrMind has a whole list of limitations associated with that argument!).

Internal monologues set to paper are all well and good, but I got to the end of part I and there is still no point in sight... Of course, maybe I'm just blind.