# Applying Bayes to an incompletely specified sample space

post by abstractapplic · 2018-07-29T17:33:53.978Z · LW · GW · 5 comments

## Contents

  Introduction
Unworkable worked example
Workable worked example
Various nuances
None


Epistemic Status: Novel mathematics from an amateur, presented with unjustified confidence. Submitted here and not in a mathematical journal because I think LW will be better and faster at explaining where I’ve gone wrong and/or what common approach I’ve reinvented.

Introduction

Bayesian reasoning is pretty great, but it has its limits. One of these is that it cannot be applied to an incomplete sample space; in other words, if you assigned 66% of your prior to possibility A, 33% of your prior to possibility B, and the remaining 1% to ??? (i.e. Everything Else), the techniques we know and love fundamentally fail to function.

Or so people seem to think. I think otherwise: in this post, I will present a new (as far as I know) method for applying Bayes, which can be made to work even if you haven’t fully specified the distribution you’re applying it to.

Unworkable worked example

Let’s say you assign your prior as above: 66% to A, 33% to B, 1% to ???. Then, you get some evidence: an event occurs which would have had a 1/11 chance of happening if A were true, and a 6/11 chance of happening if B were true.

So you multiply out your probabilities:

For A: 0.66*1/11 = 0.06

For B: 0.33*6/11 = 0.18

and for ??? . . .

Well, since we don’t know what ??? is, we can’t say exactly how it updates. But we can impose limits. The probability of the event given ??? cannot exceed 100%, and cannot be lower than 0%. As a result, we can produce upper and lower bounds.

For ???: 0.01*[0 to 1] = [0 to 0.01]

Then, we renormalise for upper and lower bounds on ???:

P(A): 0.06 / (0.06+0.18+[0 to 0.01])

P(B): 0.18 / (0.06+0.18+[0 to 0.01])

P(???): [0 to 0.01] / (0.06 + 0.18 + [0 to 0.01])

So:

P(A): 0.06 / [0.24 to 0.25]

P(B): 0.18 / [0.24 to 0.25])

P(???): [0 to 0.01] / [0.24 to 0.25]

Producing final probabilities:

P(A): 25% to 24%

P(B): 75% to 72%

P(???): 0% to 4%

“But wait!”, you may say. “This calculation is completely useless! You may be able to get an ‘answer’ out of it, but in the real world, probability exists to support decisions; and your ‘answer’ will throw positive and negative infinities if you try to multiply it to get Expected Value.”

There are two situations where this criticism doesn’t apply. The first is when we are only interested in the ratios between known parts of the distribution: the ratio between the probabilities of A and B remain constant regardless of how ??? behaves. So if you’re contemplating a bet that pays one way if A turns out to be true, pays out another way if B is true, and leaves you no better or worse in all other cases, then this method can help you plan for the future.

This first situation is artificial, and rests on fragile assumptions. For one thing, it requires that ??? not contain any possibility of A and B, which is odd if you think about it: we don’t know anything about the contents of ???, except that they don’t overlap with defined parts of the sample space? For another, it discounts the possibility that ??? contains outcomes close enough to A or B they convince whoever is judging your bet, or extreme enough that they render the bet irrelevant.

The second situation is more interesting: some domains have hard upper or lower bounds on the effects of outcomes.

Workable worked example

Amy is a proud holder of several thousand units of the cryptocurrency ExampleCoin, currently valued at $12 apiece. This is much less than it was last month: a possible bug in its’ security has been found and publicised, but not fully investigated, and as a result the price is in flux. She is trying to determine whether to cut her losses and sell now. Amy is risk-neutral, willing to put aside sunk costs, and sufficiently smart and well-positioned that she can reasonably expect to outperform the market at valuing her holdings. Amy assigns a 70% probability to the security concern being straightforwardly invalid (in which case she expects the value to rise to about$20), a 25% probability to the concern being straightforwardly valid (in which case she expects the value to drop to about $4), and a 5% probability to ???. Amy doesn’t know what’s hiding inside ???, but she knows she can’t lose more than all the money she has invested. The worst case scenario – given that she chooses not to sell – is that ??? is entirely composed of outcomes which would cause the selling price to drop to zero; but even in that case, the Expected Value of her refusing to sell is 0.7*$20 + 0.25*$4 +0.05*$0 = $15 per coin. This beats the current selling price: therefore, absent further information, she can be certain she should hold. Then, Amy runs a meaningful-but-imperfect test on the network. The result she gets would have had an 4/5 chance of occurring if the concern were valid, but only a 2/7 chance if it were not. Her new probability ratios are thus: Valid : Invalid : ??? 0.25*4/5 : 0.7*2/7 : 0.05*[0 to 1] 0.2 : 0.2 : [0 to 0.05] This normalises to: P(Valid): 50% to 44.44% P(Invalid): 50% to 44.44% P(???): 0% to 11.11% Now Amy is less than certain about how to proceed. The worst case scenario is that everything in ??? necessarily implied her result, and would reduce the selling price to$0: in this case, the Expected Value of holding is 0.4444*$4 + 0.4444*$20 + 0.1111*$0, which works out to less than the current selling price. But on the other hand, P(???) may have left the update with less than the full 11.11%, and/or it may have an EV contribution much greater than zero. For example, ??? could be composed entirely of events where the security concern was valid but easily patched (let’s say that outcome causes an average price of$18), and could have updated such that we have a 45%/45%/10% Valid/Invalid/??? posterior: in that case, the EV-maximising choice would be to hold.

Luckily, Amy discovers a second piece of evidence: an event which would have happened with 1/5 probability if the concern were valid, but 4/5 probability if it were not. For convenience’s sake, she starts from the pre-normalised posterior from the last update.

Valid : Invalid : ???

0.2*1/5 : 0.2*4/5 : [0 to 0.05]*[0 to 1]

0.04 : 0.16 : [0 to 0.05]

Which normalises to:

P(Valid): 20% to 16%

P(Invalid): 80% to 64%

P(???): 0% to 20%

Even if ??? predicted a 100% chance of the results of both experiments, and predicted the value of the coin would drop to 0, Amy could still be confident she should hold: 0.16*$4 + 0.64*$20 + 0.2*$0 =$13.44, which is again greater than the $12 she could sell for. Note the asymmetry: in the presence of limited downside and unlimited upside, the method can tell you which risks you should definitely take, but never which you should definitely eschew. (Similarly, in a domain with limited upside and unlimited downside, the method would only be able to tell you what you definitely shouldn’t do) Even if Amy’s first experiment had produced results that had a 90% chance of happening if the flaw were straightforwardly valid, and a 10% chance if it were not, she still couldn’t be certain she could sell. Valid : Invalid : ??? 0.25*0.9 : 0.7*0.1 : 0.05*[0 to 1] 0.225 : 0.025 : [0 to 0.05] Normalises to: P(Valid): 90% to 75% P(Invalid): 10% to 8.33% P(???): 0% to 16.66% In this scenario, it is obvious that the sensible answer is to sell. However, Amy cannot be certain of this, since ??? could have strictly implied the results of the test, and (for all we know) it could represent a tenfold increase in price. The best she can do after crunching the numbers is to work backwards, deriving the minimum value of ??? that would be needed for her to hold, and then see if it complies with common sense. If P(???) reaches its’ maximum of 0.1666, then the Expected Value of holding is 0.75*$4 + 0.0833*$20 + 0.1666*V(???) per coin, where V(???) is the average EV of the coin given ??? occurs. For this to exceed$12 (i.e., for it to become sensible to sell), we can find that V(???) would have to exceed \$44, which is more than double the expected value of the coin assuming the security concern is straightforwardly invalid. Amy could justifiably decide that this is implausible, and therefore decide to sell.

Various nuances

There are several nontrivial implications of this method, which I briefly explain below.

Too much uncertainty breaks the math. All else being equal, you’re more likely to get a definite answer if you end with a small upper bound for P(???); and you’re more likely to end that way if you start with a small P(???).

Too much evidence breaks the math. You may have noticed how the upper bound of P(???) continued to creep up as Amy applied successive updates. I’m sorry to say this is an inherent consequence of the method.

Highly-specific evidence breaks the math. Relatedly: if Amy’s tests had been more specific, and produced outcomes which happen with O(0.05%) probability instead of O(50%) probability, P(???) would instantly have grown to dominate the distribution.

You need to be able to decide a priori what counts as evidence. Standard Bayesianism is robust to useless evidence. If you roll a die and flip a coin, but only look at the die, you can update your guess about the coin if you want to: P(Heads|6) = P(Heads) * P(6|Heads) / P(6). But per the first three points, every update grows the upper bound of P(???); and since ??? is undefined, it’s hard to say what is and isn’t irrelevant to a distribution containing it.

You probably need upper and/or lower bounds. Ideally both; without them, you need to rely on post-hoc wrangling of V(???).

You probably need to regularly get close to those bounds. I didn’t make Amy a crypto investor to be topical: she needed to be in a high-volatility field to produce a meaningful answer. If she’d been a regular investor, weighing a 61% chance of her stock going down by 0.4% vs a 37% chance of her stock going up by 0.8% vs a 2% chance of ???, the potential effects of ??? would have dominated, and the theoretical bound of “you can’t lose more money than you invest” would have been useless.

To sum up: the method allows us to overcome limited uncertainty when presented with limited evidence. It works best in contexts where outcomes have strong upper and lower bounds, and where the stakes are as high as they could plausibly be given those bounds. Obvious areas for application are high-uncertainty business decisions, existential risk, medicine and warfare.

With that in mind, I present the final limitation of the method:

It’s just something I made up and haven’t tried to use in real life yet. I’m presenting this as a curiosity and possible starting point, not a tried-and-tested strategy. Please don’t use my math to kill people or go bankrupt.

Comments sorted by top scores.

comment by ryan_b · 2018-07-30T20:39:54.813Z · LW(p) · GW(p)

I don't have any comments about the math, but your post makes me think of Taleb. His whole shtick is how we (collectively) think very badly about unlimited-upside or unlimited-downside asymmetric problems, even with modern statistical tools. I have recently started poking around again here.

There are of course also his famous books, like Black Swan or Antifragile. I found his style interfered with the message, so I abandoned ship and went for the papers instead - but if you have a strong stomach for intellectual preening, it might be worth reading for you. He wrote a few essays available on Edge, too.

comment by dvasya · 2018-07-30T21:30:30.981Z · LW(p) · GW(p)

Check the chapter on the A_p distribution in Jaynes' book.

Replies from: Oscar_Cunningham
comment by Oscar_Cunningham · 2018-07-30T22:01:44.566Z · LW(p) · GW(p)

I've always thought that chapter was a weak point in the book. Jaynes doesn't treat probabilities of probabilities in quite the right way (for one thing they're really probabilities of frequencies). So take it with a grain of salt.

Replies from: dvasya
comment by dvasya · 2018-07-31T04:25:25.705Z · LW(p) · GW(p)

I agree, it did seem like one of the more-unfinished parts. Still, perhaps a better starting point than nothing at all?

comment by jessicata (jessica.liu.taylor) · 2018-07-31T00:46:17.975Z · LW(p) · GW(p)

I think the proper generalization of this is convex sets of probability distributions over the union of the original sample space and .