Uncertainty

post by Vaniver · 2011-11-29T23:12:11.341Z · LW · GW · Legacy · 15 comments

Contents

  Relevance 
  Continuous Distributions 
  Updating 
  Conjugate Priors
None
15 comments

This is part of a sequence on decision analysis.

Decision-making under certainty is pretty boring. You know exactly what each choice will do, and so you order the outcomes based on your preferences, and pick the action that leads to the best outcome.

Human decision-making, though, is made in the presence of uncertainty. Decision analysis - careful decision making - is all about coping with the existence of uncertainty.

Some terminology: a distinction is something uncertain; an event is each of the possible outcomes of that distinction; a prospect is an event that you have a personal stake in, and a deal is a distinction over prospects. This post will focus on distinctions and events. If you're comfortable with probability just jump to the four bolded questions and make sure you get the answers right. Deals are the interesting part, but require this background.

I should say from the very start that I am quantifying uncertainty as "probability." There is only one 800th digit of Pi (in base 10), other people already know it, and it's not going to change. I don't know what it is, though, and so when I talk about the probability that the 800th digit of Pi is a particular number what I'm describing is what's going on in my head. Right now, my map is mostly blank (I assign .1 probability to 0 to 9); once I look it up, the map will change but the territory will not. I'll use uncertainty and probability interchangeably throughout this post.

The 800th digit of Pi (in base 10) is a distinction with 10 possible events, 0 through 9. To be sensible, distinctions should be clear and unambiguous. A distinction like "the temperature tomorrow" is unclear- the temperature where, and at what time tomorrow? A distinction like "the maximum temperature recorded by the National Weather Service at the Austin-Bergstrom International Airport in the 24 hours before midnight (EST) on 11/30/2011" is unambiguous. Think of it like PredictionBook- you want to be able to create this distinction such that anyone could come across it and know what you're referring to.

Possibilities can be discrete or continuous. There are only a finite number of possible digits for the 800th digit of Pi, but the temperature is continuous and unbounded.1 A biased coin has a continuous parameter p that refers to how likely it is to land on heads in certain conditions; while that's bounded by 0 and 1, there are an infinite number of possibilities in between.

For now, let's focus on distinctions with discrete possibilities. Suppose we have four cards- two blue and two red. We shuffle the cards and draw two of them. What is the probability that both drawn cards will be red? (answer below the picture)

This is a simple problem, but one that many people get wrong, so let's step through it as carefully as possible. There are two distinctions here- the color of the first drawn card, and the color of the second drawn card. For each distinction, the possible events are blue (B) and red (R). The probability that the first card is red we'll express as P(R|&). That should be read as "probability of drawing a red card given background knowledge." The "&" refers to all the knowledge the problem has given us; sometimes it's left off and we just talk about P(R). There are four possible cards, two of which are red, and so P(R|&)=2/4=1/2.

Now we need to figure out the probability that the second card is red. We'll express that as P(R|R&), which means "the probability of drawing a red card given background knowledge and a drawn red card." There are three cards left, one of which is red, and so the probability is now 1/3.

But what we're really interested in is P(RR|&), "the probability of drawing two red cards given background knowledge." We can divide this single distinction into two distinctions: P(RR|&)=P(R|R&)*P(R|&)=1/2*1/3=1/6. Probabilities are conjoined by multiplication.

Notice that, for the first two cards drawn, there are four events: RR, RB, BR, and BB. Those events have different probabilities: 1/6, 1/3, 1/3, and 1/6. Those represent the joint probability distribution of the first two cards, and the joint probability distribution contains all the information we need. If you're interested in the chance that the second card is blue with no information about the first (P(*B|&)), you add up RB and BB to get 1/3+1/6=1/2 (which is what you should have expected it to be).

Bayes' Rule, by the way, is easy to see when discussing events. If I wanted to figure out P(RB|*B&), what I want to do is take the event RB (probability 1/3) and make it more likely by dividing out the probability of my current state of knowledge (that the second card was blue, probability 1/2). Alternatively, I could consider the event RB as a fraction of the set of events that fit my knowledge, which is both RB and BB- (1/3)/(1/3+1/6)=2/3.

Relevance
 

Most people who get the question about cards wrong get it wrong because they square 1/2 to get 1/4, forgetting that the second card depends on the first. Since there's a limited supply of cards, as soon as you draw one you can be more certain that the next card isn't that color.

Dependence is distinct from causality. If I hear the weatherman claim that it will rain with 50% probability, that will adjust my certainty that it will rain, even though the weatherman can't directly influence whether or not it will rain. Some people use the word relevance instead, as it's natural to think that the weatherman's prediction is relevant to the likelihood of rain but may not be natural to think that the chance of rain depends on the weatherman's prediction.

Relevance goes both ways. If the weatherman's prediction gives me knowledge about whether or not it will rain, then knowing whether or not it rained gives me knowledge about what the weatherman's prediction was. Bayes' Rule is critical for maneuvering through relevant distinctions. Suppose the weatherman could give only two predictions: Sunny or Rainy. If he predicts Sunny, it will rain with 10% probability. If he predicts Rainy, it will rain with 50% probability. If it rains 20% of the time, how often does he predict Rainy? (answer)

Suppose it rains. What's the chance that the weatherman predicted Rainy? (answer below the picture)

This is a simple application of Bayes' Rule: P(Rainy|Rain)=P(Rain|Rainy)P(Rainy)/P(Rain).

Alternatively, we can figure out the probabilities of the four elementary events: P(Rainy,Rain)=.125, P(Rainy,Sun)=.125, P(Sunny,Rain)=.075, P(Sunny,Sun)=.675. If we know it rained and want to know if he predicted Rainy, we care about P(Rainy,Rain)/(P(Rainy,Rain)+P(Sunny,Rain)).

This can get very complicated if there are a large number of events or relevant distinctions, but software exists to solve that problem.

Continuous Distributions
 

Suppose, though, that you don't have just two events to assign probability to. Instead of being uncertain about whether or not it will rain, I might be uncertain about how much it will rain, conditioned on it raining.2 If I try to elicit a probability for every possible amount, that'll take me a long time (unless I bin the heights, making it discrete, which still might take far longer or be far harder than I can deal with, if there are lots of bins).

In that case, I would express my uncertainty as a probability density function (pdf) or cumulative probability density function (cdf). The first is the probability density at a particular value, whereas the second is the density integrated from the beginning of the domain to that value. To get a probability from a density, you have to integrate. A pdf can have any non-negative value and any shape over the domain, though it has to integrate to 1, while a cdf has a minimum of 0, a maximum of 1, and is non-decreasing.

Let's take the example of the biased coin. To make it more precise, since coin flips are messy and physical, suppose I have some random number generator that uniformly generates any real number between 0 and 1, and a device hooked up to it with an unknown threshold value p between 0 and 1.3 When I press a button, the generator generates a random number, hands it to the device, which then shows a picture of heads if the number is below or equal to the threshold and a picture of tails if the number is above the threshold. I don't get to see the number that was generated- just a head or tail every time I press the button.

I begin by being uncertain about the threshold value, except knowing its domain. I assign a uniform prior- I think it's equally likely that the threshold value is at every point between 0 and 1. Mathematically, that means my pdf is P(p=x)=1. I can integrate that from 0 to y to get a cdf of C(p≤y)=∫1dx=y. Like we needed, the pdf integrates to 1, the cdf has a minimum of 0 and maximum of 1, and is non-decreasing. From those, we can calculate my certainty that the threshold value is in a particular range (by integrating the pdf over that range) or any particular point (0, because it's an integral of 0 width).

Updating
 

Now we press the button, see something, and need to update our uncertainty (probability distribution). How should we do that?

Well, by Bayes' rule of course! But I'll do it in a somewhat roundabout way, to give you some more intuition why the rule works. Suppose we saw heads. For each possible threshold value, we know how likely that was- p, the threshold value. We can now compute the probability density of (heads if p) and (p) by multiplying those together, and x times 1 = x. So my pdf is now P(p=x)=x and cdf is C(p≤y)=.5y2.

Well, not quite. My pdf doesn't integrate to 1, and my cdf, while it does have a min at 0, doesn't have a max of 1. I need to renormalize- that is, divide by the chance that I saw heads in the first place. That was 1/2, and so I get P(p=x)=2x and C(p≤y)=y2 and everything works out. If I saw tails, my likelihood is instead 1-p, and that propagates through to P(p=x)=2-2x and C(p≤y)=2y-y2.

Suppose my setup were even less helpful. Instead of showing heads or tails, it instead generates two numbers, computes heads or tails for each number separately, and then prints out either "S" if both results were the same or "D" if the results were different. If I start with a uniform prior, what will my pdf and cdf on the threshold value p be after I see S? If I saw D instead? (If you don't know calculus, don't worry- most of the rest of this sequence will deal with discrete events.)

I recommend giving it a try before checking, but the pdf is linked here for S and here for D. (cdfs: S and D)

Conjugate Priors

That's a lot of work to do every time you get information, though. If you pick what's called a conjugate prior, updating is simple, whereas it requires multiplication and integration for an arbitrary prior. The uniform prior is a conjugate prior for the simple biased coin problem, because uniform is a special case of the beta distribution. You can use Be(heads+1,tails+1) as your posterior probability for any number of heads and tails that you see, and the math is already done for you. Conjugate priors are a big part of doing continuous Bayesian analysis in practice, but won't be too relevant to the rest of this sequence.

 


1. The temperature as recorded by the National Weather Service is not continuous and is, in practice, bounded. (The NWS will only continue existing for some temperature range, and even if a technical error caused the NWS to record a bizarre temperature, they're limited by how their system stores numbers.)

2. I would probably narrow my prediction down to the height of the water in a graduated cylinder set in a representative location.

3. In case you're wondering, this sort of thing is fairly easy to create with a two-level quantum system and thus get "genuine" randomness.

15 comments

Comments sorted by top scores.

comment by AShepard · 2011-12-01T21:03:09.431Z · LW(p) · GW(p)

I'm having difficulties with your terminology. You've given special meanings to "distinction", "prospect", and "deal" that IMO don't bear any obvious relationship to their common usage ("event" makes more sense). Hence, I don't find those terms helpful in evoking the intended concepts. Seeing "A deal is a distinction over prospects" is roughly as useful to me as seeing "A flim is a fnord over grungas". In both case, I have to keep a cheat-sheet handy to understand what you mean, since I can't rely on an association between word and concept that I've already internalized. Maybe this is accepted terminology that I'm not aware of?

Replies from: Vaniver
comment by Vaniver · 2011-12-01T22:29:42.770Z · LW(p) · GW(p)

I'm having difficulties with your terminology.

I'm not sure yet how much the terminology will pop up in future articles (one of the pitfalls of posting them as you go). I don't think it will matter much, but if it future posts are unclear point out where the language is problematic and I'll try to make things clearer.

comment by Psychohistorian · 2011-11-30T20:43:13.662Z · LW(p) · GW(p)

While the probabilistic reasoning employed in the card question is correct and fits in with your overall point, it's rather labor-intensive to actually think through.

In order to get two red cards, you need to pick the right pair of cards. Only one pair will do. There are six ways to pick a pair of cards out of a group of 4 (when, as here, order doesn't matter). Therefore, the odds are 1/6, as one out of the six possible pairs you'll pick will be the correct pair.

Similarly, we know the weatherperson correctly predicts 12.5% of days that will be rainy. We know that 20% of days will actually be raining. That gives us "12.5/20 = 5/8" pretty quickly. Grinding our way through all the P(X [ ~X) representation makes a simple and intuitive calculation look really intimidating.

I'm not entirely sure of your purpose in this sequence, but it seems to be to improve people's probabilistic reasoning. Explaining probabilities through this long and detailed method seems guaranteed to fail. People who are perfectly comfortable with such complex explanations generally already get their application. People who are not so comfortable throw up their hands and stick with their gut. I suspect that a large part of the explanation of mathematical illiteracy is that people aren't actually taught how to apply mathematics in any practical sense; they're given a logically rigorous and formal proof in unnecessary detail which is too complex to use in informal reasoning.

Replies from: malthrin, Vaniver
comment by malthrin · 2011-12-01T03:56:34.013Z · LW(p) · GW(p)

Speaking only for myself, I'm in that awkward middle stage - I understand probability well enough to solve toy problems, and to follow explanations of it in real problems, but not enough to be confident in my own probabilistic interpretation of new problem domains. I'm looking forward to this sequence as part of my education and definitely appreciate seeing the formality behind the applications.

comment by Vaniver · 2011-11-30T22:55:52.053Z · LW(p) · GW(p)

I'm glad this is intuitive for you!

The reason I spotlighted labor-intensive methods is because this post is targeted at people who don't find this intuitive. I'd rather give them a method that can be extended to other situations with low risk (applying Bayes' Rule, imagining the world after receiving an update and calculating new probabilities) rather than identifying symmetries in the problems and using those to quickly get answers.

The rest of the sequence uses this as background, but probability calculations play a secondary role. The techniques I'll discuss require a moderate level of comfort with probabilities, but not with probabilistic calculations- those can (and probably should) be offloaded to a calculator. The challenge is setting up the right problem, not solving a problem once you've set it up.

comment by malthrin · 2011-11-30T20:47:39.453Z · LW(p) · GW(p)

Can you elaborate on the calculation for S? I think it should be this, but I'm not confident in my math.

Replies from: Vaniver, cadac
comment by Vaniver · 2011-11-30T22:38:20.970Z · LW(p) · GW(p)

Yours was correct; editing the post. I skipped a step and that made my previous answer wrong.

comment by cadac · 2011-12-11T23:29:32.255Z · LW(p) · GW(p)

Maybe I'm missing something obvious here, but I'm unsure how to calculate P(S). I'd appreciate it if someone could post an explanation.

Replies from: malthrin
comment by malthrin · 2011-12-12T17:00:30.353Z · LW(p) · GW(p)

Sure. S results from HH or from TT, so we'll calculate those independently and add them together at the end. We'll do that by this equation: P(p=x|S) = P(p=x|HH) P(H) + P(p=x|TT) P(T).

We start out with a uniform prior: P(p=x) = 1. After observing one H, by Bayes' rule, P(p=x|H) = P(H|p=x) P(p=x) / P(H). P(H|p=x) is just x. Our prior is 1. P(H) is our prior, multiplied by x, integrated from 0 to 1. That's 1/2. So P(p=x|H) = x1/(1/2) = 2x.

Apply the same process again for the second H. Bayes' rule: P(p=x|HH) = P(H|p=x,H) P(p=x|H) / P(H|H). The first term is still just x. The second term is our updated belief, 2x. The denominator is our updated belief, multiplied by x, integrated from 0 to 1. That's 2/3 this time. So P(p=x|HH) = x2x/(2/3) = 3x^2.

Calculating tails is similar, except we update with 1-x instead of x. So our belief goes from 1, to 2-2x, to 3x^2-6x+3. Then substitute both of these into the original equation: (3/2)(x^2) + (3/2)(x^2 - 2x + 1). From there it's just a bit of algebra to get it into the form I linked to.

comment by beoShaffer · 2011-11-30T16:57:19.962Z · LW(p) · GW(p)

I am really happy to see more formal Bayes on LW. Ditto for decision analysis. They get talked about frequently but I don't usually see to much math being used. That said I was slightly confused, specifically its pretty clear what cdf and pdf are in terms of how they are derived from probability density. However its not quite clear what you mean by probability density. Am I overlooking/misunderstanding a explanation or are we assumed to already know what it is?

Replies from: Vaniver
comment by Vaniver · 2011-11-30T19:19:58.063Z · LW(p) · GW(p)

A probability density is just like any other kind of density; it's the amount of probability per unit volume. (In one-dimension, the 'volume' equivalent is length.) You need it when you have a continuous belief space but not when you have a discrete belief space. If you're doing billiard ball physics with point masses, you don't need mass densities; likewise if you're comparing billiard ball beliefs rather than real ones (the weatherman doesn't say "Rainy" or "Sunny" but expresses a percentage) you don't need probability densities.

Replies from: beoShaffer
comment by beoShaffer · 2011-11-30T19:49:32.782Z · LW(p) · GW(p)

Ok, that makes sense.

comment by isionous · 2013-02-22T13:30:16.794Z · LW(p) · GW(p)

The wolfram alpha links in the article and previous comments seem to be broken in that the parentheses in the mathematical expression are missing, meaning that the links present readers with the wrong answer. It was rather confusing for me for a bit. You might want to update the links to something like this:

S pdf: http://www.wolframalpha.com/input/?i=integrate+3%2F2+*+%281+-+2*x+%2B+2*x^2%29+++from+0+to+1
D pdf: http://www.wolframalpha.com/input/?i=integrate+6+*+%28x+-+x^2%29+++from+0+to+1
comment by beoShaffer · 2011-11-30T19:19:09.489Z · LW(p) · GW(p)

Also the pages says that there are two comments but it is only loading one. Before I posted it similarly said that there was one comment but displayed no comments. Does anyone know whats going on?

Replies from: Vaniver
comment by Vaniver · 2011-11-30T19:26:21.374Z · LW(p) · GW(p)

I have no clue, but I'm also seeing it as counting a phantom comment.