Bayesianism in the face of unknowns

rstarkov

Bayesianism in the face of unknowns

post by rstarkov · 2011-03-12T20:54:48.629Z · LW · GW · Legacy · 36 comments

36 comments

Suppose I tell you I have an infinite supply of unfair coins. I pick one randomly and flip it, recording the result. I've done this a total of 100 times and they all came out heads. I will pay you $1000 if the next throw is heads, and $10 if it's tails. Each unfair coin is entirely normal, whose "heads" follow a binomial distribution with an unknown p. This is all you know. How much would you pay to enter this game?

I suppose another way to phrase this question is "what is your best estimate of your expected winnings?", or, more generally, "how do you choose the maximum price you'll pay to play this game?"

Observe that the only fact you know about the distribution from which I'm drawing my coins is those 100 outcomes. Importantly, you don't know the distribution of each coin's p in my supply of unfair coins. Can you reasonably assume a specific distribution to make your calculation, and claim that it results in a better best estimate than any other distribution?

Most importantly, can one actually produce a "theoretically sound" expectation here? I.e. one that is calibrated so that if you pay your expected winnings every time and we perform this experiment lots of times then your average winnings will be zero - assuming I'm using the same source of unfair coins each time.

I suspect that the best one can do here is produce a range of values with confidence intervals. So you're 80% confident that the price you should pay to break even in the repeated game is between A80 and B80, 95% confident it's between A95 and B95, etc.

If this is really the best obtainable result, then what is a bayesianist to do with such a result to make their decision? Do you pick a price randomly from a specially crafted distribution, which is 95% likely to produce a value between A95..B95, etc? Or is there a more "bayesian" way?

36 comments

Comments sorted by top scores.

comment by Manfred · 2011-03-12T23:30:45.474Z · LW(p) · GW(p)

Since the only thing that matters (to your expected value) is the overall probability of heads v. tails, the problem is a simple one of parameter estimation. You start with a distribution over p (determined by this thing called the principle of maximum entropy, probably just uniform in this case) and then update it with new evidence (for example, P(p=0.01) drops almost to 0 as soon as you see a head). After 100 heads you get a very spiky function at 1. And yes this is optimal.

For any finite amount of data you won't perfectly break even using a bayesian method, but it's better than all the alternatives, as long as you don't leave out some data.

Replies from: Oscar_Cunningham, Timwi

↑ comment by Oscar_Cunningham · 2011-03-13T12:05:16.168Z · LW(p) · GW(p)

For any finite amount of data you won't perfectly break even using a bayesian method, but it's better than all the alternatives, as long as you don't leave out some data.

What!? Are you saying that you can predict in advance that you'll lose money? Surely that can't happen, because you get to choose how much you want to pay, so you can always just pay less. No?

Replies from: Manfred, rstarkov, benelliott

↑ comment by Manfred · 2011-03-14T00:10:01.196Z · LW(p) · GW(p)

For simplicity, let's imagine betting on a single coin with P(heads) = 0.9. You say "how much will you pay to win $1 if it lands heads?" and I say "50 cents," because at the start I am ignorant. You flip it and it lands heads. I just made 40 cents relative to the equilibrium value.

So it's not predictably losing money. It's predictably being wrong in an unpredictable direction.

↑ comment by rstarkov · 2011-03-13T14:34:12.343Z · LW(p) · GW(p)

I read this to say that you can't calculate a value that is guaranteed to break even in the long term, because there isn't enough information to do this. (which I tend to agree with)

↑ comment by benelliott · 2011-03-13T12:21:06.871Z · LW(p) · GW(p)

Perhaps he means that you break even once opportunity costs are taken into account, that is you won't win as much money as you theoretically could have won.

↑ comment by Timwi · 2011-03-12T23:34:11.066Z · LW(p) · GW(p)

That sounds pretty much the same as what I said above.

Replies from: Manfred

↑ comment by Manfred · 2011-03-13T03:37:38.238Z · LW(p) · GW(p)

Yup. Except maybe with a little more confidence that Bayes' rule applies here in the specific way of altering the probability distribution over p at each point.

comment by Alex Flint (alexflint) · 2011-03-13T12:43:16.318Z · LW(p) · GW(p)

I think the key point here is that there is uncertainty in both the distribution of coins and the outcome of any particular coin flip. The Bayesian answer is that you roll these two sources of uncertainty into a single uncertainty over the outcome of the next coin flip.

If this were a real world situation then your evidence would (ideally) include such unwieldy things as facts about your possible motivations, all the real-world facts about the physics of coin flips, and so on. Bayesianism tells us that there is a unique answer in the form of a probability for the next coin to be heads, but obviously the math to figure that out is enormously complicated.

If you don't want to deal with all this then you can pick some nicer mathematical starting point, like that the you have a uniform distribution over how biased the coins are, or a beta distribution, or some such. In this case do as follows

Let h be some hypothesis about what distribution the unfair coins follow
Write out P(h) according to the assumptions you made. A reasonable choice is the maximum entropy distribution (which in this case is uniform over the parameter p)
Let D be the data that you observed (100 heads). Write out P(D | h) - this will be a straightforward binomial distribution in this problem.
Write out P(h | D) - give a hurrah for Bayes rule at this point
Let T be the event of getting tails on the next flip. Write P(T | D) by marginalising over the hypotheses like integral-over-h P(T, h | D) = integral-over-h P(T | h) P(h | D)

That's it!

Replies from: rstarkov

↑ comment by rstarkov · 2011-03-13T14:42:44.601Z · LW(p) · GW(p)

Bayesianism tells us that there is a unique answer in the form of a probability for the next coin to be heads

I'm obviously new to this whole thing, but is this a largely undebated, widely accepted view on probabilities? That there are NO situations in which you can't meaningfully state a probability?

For example, let's say we have observed 100 samples of a real-valued random variable. We can use the maximum entropy principle, and thus use the normal distribution (whcih is maximal-entropy for unbounded reals). We then use standard methods to estimate population mean, and can even provide a probability that it's in a certain interval.

But how valid is this result when we knew nothing of the original distribution? What if it was something awkward like the Cauchy distribution? It has no mean; so our interval is meaningless. You can't just say that "well, we're 60% certain it's in this interval, that leaves 40% chance of us being wrong" - because it doesn't; the mean isn't outside the interval either! A complete answer would allow for a third outcome, that the mean isn't defined, but how exactly do you assign a number to this probability?

With this in mind, do we still believe that it's not wrong (or less wrong? :D) to assume a normal distribution, make our calculations and decide how much you'd bet that the mean of the next 100,000 samples is in the range -100..100? (the sample means of Cauchy distributions diverge as you add more samples)

Replies from: Richard_Kennaway, alexflint

↑ comment by Richard_Kennaway · 2011-03-16T12:39:10.227Z · LW(p) · GW(p)

I'm obviously new to this whole thing, but is this a largely undebated, widely accepted view on probabilities? That there are NO situations in which you can't meaningfully state a probability?

It does seem to be widely accepted and largely undebated. However, it is also widely rejected and largely undebated, for example by Andrew Gelman, Cosma Shalizi, Ken Binmore, and Leonard Savage (to name just the people I happen to have seen rejecting it -- I am not a statistician, so I do not know how representative these are of the field in general, or if there has actually been a substantial debate anywhere). None of them except Ken Binmore actually present arguments against it in the material I have read, they merely dismiss the idea of a universal prior as absurd. But in mathematics, only one thing is absurd, a contradiction, and by that standard only Ken Binmore has offered any mathematical arguments. He gives two in his book "Rational Decisions": one based on Gödel-style self-reference, and the other based on a formalisation of the concept of "knowing that" as the box operator of S5 modal logic. I haven't studied the first but am not convinced by the second, which fails at the outset by defining "I know that" as an extensional predicate. (He identifies a proposition P with the set of worlds in which it is true, and assumes that "I know that P" is a function of the set representing P, not of the syntactic form of P. Therefore by that definition of knowing, since I know that 2+2=4, I know every true statement of mathematics, since they are all true in all possible worlds.)

(ETA: Binmore's S5 argument can also be found online here.)

(ETA2: For those who don't have a copy of "Rational Decisions" to hand, here's a lengthy and informative review of it.)

These people distinguish "small-world" Bayesianism from "large-world" Bayesianism, they themselves being small-worlders. Large-worlders would include Eliezer, Marcus Hutter, and everyone else who believes in the possibility of a universal prior.

A typical small-world Bayesian argument would be: I hypothesise that a certain variable has a Gaussian distribution with unknown parameters over which I have a prior distribution; I observe some samples; I obtain a posterior distribution for the parameters. A large-world Bayesian also makes arguments of this sort and they both make the same calculations.

Where they part company is when the variable in fact does not have a Gaussian distribution. For example, suppose it is a sum of two widely separated Gaussians. According to small-worlders, the large-world Bayesian is stuck with his prior hypothesis of a single Gaussian, which no quantity of observations will force him to relinquish, since it is his prior. His estimate of the mean of the Gaussian will drift aimlessly up and down like the Flying Dutchman between the two modes of the real distribution, unable to see the world beyond his prior. According to large-worlders, that prior was not the real prior which one started from. That whole calculation was really conditional on the assumption of a Gaussian, and this assumption itself has a certain prior probability less than 1, and was chosen from a space of all possible hypothetical distributions. The small-worlders reply that this is absurd, declare victory, and walk away without listening to the large-worlders explain how to choose universal priors. Instead, small-worlders insist that to rectify the fault of having hypothesised the wrong model, one must engage in a completely different non-Bayesian activity called model-checking. Chapter 6 of Gelman's book "Bayesian Data Analysis" is all about that, but I haven't read it. There is some material in this paper by Gelman and Shalizi.

(ETA: I have now read Gelman ch.6. Model-checking is performed by various means, such as (1) eyeballing visualisations of the real data and simulated data generated by the model, (2) comparing statistics evaluated for both real and simulated data, or (3) seeing if the model predicts things that conflict with whatever other knowledge you have of the phenomenon being studied.)

And that's as far as I've read on the subject. Have the small-worlders ever responded to large-worlders' construction of universal priors? Have the large-worlders ever demonstrated that universal priors are more than a theoretical construction without practical application? Has "model checking" ever been analysed in large-world Bayesian terms?

↑ comment by Alex Flint (alexflint) · 2011-03-16T09:15:22.821Z · LW(p) · GW(p)

I'm obviously new to this whole thing, but is this a largely undebated, widely accepted view on probabilities? That there are NO situations in which you can't meaningfully state a probability?

Actually, yes, but you're right to be surprised because it's (to my mind at least) an incredible result. Cox's theorem establishes this as a mathematical result from the assumption that you want to reason quantitatively and consistently. Jaynes gives a great explanation of this in chapters 1 and 2 of his book "Probability Theory".

But how valid is this result when we knew nothing of the original distribution?

The short answer is that a probability always reflects your current state of knowledge. If I told you absolutely nothing about the coin or the distribution, then you would be entirely justified in assigning 50% probability to heads (on the basis of symmetry). If I told you the exact distribution over p then you would be justified in assigning a different probability to heads. But in both cases I carried out the same experiment -- it's just that you had different information in the two trials. You are justified in assigning different probabilities because Probability is in the mind. The knowledge you have about the distribution over p is just one more piece of information to roll into your probability.

With this in mind, do we still believe that it's not wrong (or less wrong? :D) to assume a normal distribution, make our calculations and decide how much you'd bet that the mean of the next 100,000 samples is in the range -100..100?

That depends on the probability that the coin flipper chooses a Cauchy distribution. If this were a real experiment then you'd have to take into account unwieldy facts about human psychology, physics of coin flips, and so on. Cox's theorem tells us that in this case there is a unique answer in the form of a probability, but it doesn't guarantee that we have time, resources, or inclination to actually calculate it. If you want to avoid all those kinds of complicated facts then you can start from some reasonable mathematical assumptions such as a normal distribution over p - but if your assumptions are wrong then don't be surprised when your conclusions turn out wrong.

Replies from: rstarkov, alexflint

↑ comment by rstarkov · 2011-03-24T14:11:37.131Z · LW(p) · GW(p)

Thanks for this, it really helped.

it doesn't guarantee that we have time, resources, or inclination to actually calculate it

Here's how I understand this point, that finally made things clearer:

Yes, there exists a more accurate answer, and we might even be able to discover it by investing some time. But until we do, the fact that such an answer exists is completely irrelevant. It is orthogonal to the problem.

In other words, doing the calculations would give us more information to base our prediction on, but knowing that we can do the calculation doesn't change it in the slightest.

Thus, we are justified to treat this as "don't know at all", even though it seems that we do know something.

Probability is in the mind

Great read, and I think things have finally fit into the right places in my head. Now I just need to learn to guesstimate what the maximum entropy distribution might look like for a given set of facts :)

Well, that and how to actually churn out confidence intervals and expected values for experiments like this one, so that I know how much to bet given a particular set of knowledge.

Replies from: alexflint

↑ comment by Alex Flint (alexflint) · 2011-03-24T15:19:35.030Z · LW(p) · GW(p)

Cool, glad it was helpful :)

Here is one interesting post about how to encourage our brains to output specific probabilities: http://lesswrong.com/lw/3m6/techniques_for_probability_estimates/

↑ comment by Alex Flint (alexflint) · 2011-03-16T09:17:32.688Z · LW(p) · GW(p)

Actually I didn't explain that middle bit well at all. Just see http://lesswrong.com/lw/oi/mind_projection_fallacy/

comment by cousin_it · 2011-03-13T01:46:59.070Z · LW(p) · GW(p)

I.e. one that is calibrated so that if you pay your expected winnings every time and we perform this experiment lots of times then your average winnings will be zero - assuming I'm using the same source of unfair coins each time.

This may be tangential, but how do you run this experiment lots of times? If you just abort the runs where you get any tails among the initial 100 throws, then I think seeing 100 heads in a row doesn't mean anything much.

Replies from: alexflint, rstarkov, endoself

↑ comment by Alex Flint (alexflint) · 2011-03-13T12:25:41.791Z · LW(p) · GW(p)

I see the point you're making about observation selection effects but surely in this case it doesn't flatten the posterior very much. Of all the times you see a coin come up heads 100 times in a row, most of them will be for coins with p(heads) close to 1, even if you are discarding all other runs. That's assuming you select coins independently for each run.

Replies from: alexflint

↑ comment by Alex Flint (alexflint) · 2011-03-13T12:28:00.477Z · LW(p) · GW(p)

Hmm perhaps I mis-read the post. I was assuming he was picking a single coin and flipping it 100 times.

Replies from: othercriteria

↑ comment by othercriteria · 2011-03-13T14:38:09.046Z · LW(p) · GW(p)

The description of the coin flips having a Binomial(n=?,p) distribution, instead of a Bernoulli(p) distribution, might be a cause of the mis-reading.

Replies from: rstarkov

↑ comment by rstarkov · 2011-03-13T20:07:10.558Z · LW(p) · GW(p)

Perhaps - obviously each coin is flipped just once, i.e. Binomial(n=1,p), which is the same thing as Bernoulli(p). I was trying to point out that for any other n it would work the same as a normal coin, if someone were to keep flipping it.

↑ comment by rstarkov · 2011-03-13T11:48:12.242Z · LW(p) · GW(p)

You take the evidence, and you decide that you pay X. Then we run it lots of times. You pay X, I pick a random coin and flip it. I pay your winnings. You pay X again, I pick again, etc. X is fixed.

↑ comment by endoself · 2011-03-13T03:14:12.999Z · LW(p) · GW(p)

Your memory could be wiped of just the throw that you were asked to bet on, so that the 100 throws of heads do not have to be repeated. Equivalently, you could place all bets before any of them are evaluated.

comment by Timwi · 2011-03-12T22:46:27.705Z · LW(p) · GW(p)

My take at it is basically this: average over all possible distributions until you have further evidence. (Preferably, let other people play the game first to gather the evidence at no cost to myself.)

If someone tells me a coin has an unknown binomial distribution, and we really genuinely don’t know anything about this distribution (not even the distribution of possible distributions), I take the set of all possible distributions and assume they are all equally likely. Since they are symmetric, the average is a 50:50 fair coin.

In your example, you throw not just one coin, but a different one each time. Differently put, the sequence of coins is a random variable whose distribution we don’t know. I therefore begin by assuming it is the average over all possible distributions, and within that average distribution, the distribution of first coins is symmetric, so its average is 50:50, so for the first coin toss I’ll assume that it’s a 50:50.

Say the first coin toss now comes out tails. This provides me with a slight piece of evidence that your set of unfair coins may be biased towards containing more tail-preferring coins than head-preferring coins. Therefore, I will assume that the second coin is more likely to be a tail-preferring coin than a head-preferring one. I’d have to do the maths to find out exactly how much my ante would change.

With each new coin toss, I learn more about the likelihood of each distribution of coins within the set of all possible distributions.

I suspect that the level of indirection can actually be safely removed and we can regard the whole thing as a single random binary digit generator with a single distribution, about which we initially don’t know anything and gradually build up evidence. I further suspect that if the distribution of coins is uniformly distributed, and therefore symmetric, the sequence of coin tosses is equivalent to a single, fair coin. Once again, someone would have to do the maths to prove whether this is actually equivalent.

Replies from: rstarkov

↑ comment by rstarkov · 2011-03-12T23:10:15.962Z · LW(p) · GW(p)

Preferably, let other people play the game first to gather the evidence at no cost to myself.

For the record, this is not permitted.

My take at it is basically this: average over all possible distributions

It's easy to say this but I don't think this works when you start doing the maths to get actual numbers out. Additionally, if you really take ALL possible distributions then you're already in trouble, because some of them are pretty weird - e.g. the Cauchy distribution doesn't have a mean or a variance.

distribution about which we initially don’t know anything and gradually build up evidence

I'd love to know if there are established formal approaches to this. The only parts of statistics that I'm familiar with assume known distributions and work from there. Anyone?

Replies from: saturn, Perplexed

↑ comment by saturn · 2011-03-13T03:37:26.845Z · LW(p) · GW(p)

If I'm not mistaken, the Cauchy distribution wouldn't be included because it's not supported on a bounded interval.

↑ comment by Perplexed · 2011-03-13T04:09:13.797Z · LW(p) · GW(p)

I'd love to know if there are established formal approaches to this.

You should probably look at Jaynes's book "Probability Theory: the Language of Science". In particular, I think that the discussion there dealing with the Widget Problem and with Laplace's Rule of Succession may be relevant to your question.

Replies from: rstarkov

↑ comment by rstarkov · 2011-03-13T15:26:03.175Z · LW(p) · GW(p)

And just as it gets really interesting, that chapter ends. There is no solution provided for stage 4 :/

Replies from: Perplexed

↑ comment by Perplexed · 2011-03-13T17:23:13.962Z · LW(p) · GW(p)

Odd. The printed book has another page and a half for that chapter, including the solution to Stage 4. (No surprises in the solution - same as stage 3 except you start with 40 fewer Green widgets.)

comment by Dorikka · 2011-03-12T21:14:50.178Z · LW(p) · GW(p)

I figure that people usually offer bets to make money, so if I assume that you are a selfish actor, your coins will come up tails practically all the time -- you have no incentive that I can think of to ever make them come out heads. Expected payout is $10 each time, so I will pay a maximum of $10 to play the game each time.

Or is your intent for this to be an "Omega tells you..." thought-experiment?

Replies from: rstarkov

↑ comment by rstarkov · 2011-03-12T21:17:23.430Z · LW(p) · GW(p)

I have clarified my post to specify that for each flip, I pick a coin from this infinite pool at random. Suppose you also magically know with absolute certainty that these givens are true. Still $10?

Replies from: Dorikka

↑ comment by Dorikka · 2011-03-12T22:34:24.183Z · LW(p) · GW(p)

You still decide what coins are in the infinite pool, right?

Replies from: rstarkov

↑ comment by rstarkov · 2011-03-12T22:36:22.736Z · LW(p) · GW(p)

The properties of the pool are unknown to you, so you have to take into account the possibility that I've tuned them somehow. But you do know that the 100 coins I drew from that pool were drawn fairly and randomly.

Replies from: Dorikka, endoself

↑ comment by Dorikka · 2011-03-13T03:20:49.452Z · LW(p) · GW(p)

Okay. I retract my permanent answer of $10.

↑ comment by endoself · 2011-03-13T03:15:21.358Z · LW(p) · GW(p)

Would you have tried to bet us anyways if you had not landed heads 100 times?

Replies from: rstarkov

↑ comment by rstarkov · 2011-03-13T11:53:40.925Z · LW(p) · GW(p)

If I were trying to make a profit then I'd need to know how much to charge for entry. If I could calculate that then yes, I'd offer the bet regardless of how many heads came out of 100 trials.

But this is entirely beside the point; the purpose of this thought experiment is for me to show which parts of bayesianism I don't understand and solicit some feedback on those parts.

In particular, a procedure that I could use to actually pick a break-even price of entry would be very helpful.

Replies from: endoself, endoself

↑ comment by endoself · 2011-03-13T20:48:54.561Z · LW(p) · GW(p)

This is actually essential to the problem. If you would only bet me if the coins came up heads, you could make all the coins heavily biased toward tales. In the rare scenario that 100 coins happened to come up anyways, you could show this to me to try to trick me into accepting, when you know that the next coin, like all the others, is guaranteed to be biased toward tales.

For actually solving the problem, I am no expert, but I think Laplace's law of succession applies here. Laplace's law states that when the only information that you have is the fact that there are only two possible results to a process (such as heads and tales) and the results to a number of trials that have already been done, the probability that the next result turns out a specific way is (s+1)/(n+2), where s is the number of times that is happened that way in the past and n is the total number of trials so far. I am not sure if that applies here because we may be working with a bit more information than this, but it might be correct.

In this case:

P(heads) = 101/102

EV(heads) = 101/102*V(heads) + 1/102*V(tails) = (101*$1000+1*$10)/102 = $101010/102 = $990 5/17

You can read more about Laplace's law, including a derivation, in chapter 6 of Probability Theory: the Logic of Science by Edwin T. Janes.

↑ comment by endoself · 2011-03-13T20:32:19.882Z · LW(p) · GW(p)

Bayesianism in the face of unknowns

Contents

36 comments