# Value of Information: Four Examples

post by Vaniver · 2011-11-22T23:02:43.098Z · LW · GW · Legacy · 60 comments## Contents

with Biased Coins where to invest Testing a Career Conclusion None 56 comments

Value of Information (VoI) is a concept from decision analysis: how much answering a question allows a decision-maker to improve its decision. Like opportunity cost, it's easy to define but often hard to internalize; and so instead of belaboring the definition let's look at some examples.

## Gambling with Biased Coins

Normal coins are approximately fair.^{1} Suppose you and your friend want to gamble, and fair coins are boring, so he takes out a quarter and some gum and sticks the gum to the face of the quarter near the edge. He then offers to pay you $24 if the coin lands gum down, so long as you pay him $12 to play the game. Should you take that bet?

First, let's assume risk neutrality for the amount of money you're wagering. Your expected profit is $24*p*-12*, *where *p* is the probability the coin lands gum down. This is a good deal if *p>*.5, but a bad deal if *p<*.5. So... what's *p*? More importantly, how much should you pay to figure out *p*?

A Bayesian reasoner looking at this problem first tries to put a prior on *p*. An easy choice is a uniform distribution between 0 and 1, but there are a lot of reasons to be uncomfortable with that distribution. It might be that the gum will be more likely to be on the bottom- but it also might be more likely to be on the top. The gum might not skew the results very much- or it might skew them massively. You could choose a different prior, but you'd have trouble justifying it because you don't have any solid evidence to update on yet.^{2}

If you had a uniform prior and no additional evidence, then the deal as offered is neutral. But before you choose to accept or reject, your friend offers you *another* deal- he'll flip the coin once and let you see the result before you choose to take the $12 deal, but you can't win anything on this first flip. How much should you pay to see one flip?

Start by modeling yourself after you see one flip. It'll either come up gum or no gum, and you'll update and produce a posterior for each case. In the first case, your posterior on *p* is P(*p*)=2p; in the second, P(*p*)=2-2p. Your expected profit for playing in the first case is $4;^{3} your expected profit for playing in the second case is **negative **$4. You think there's a half chance it'll land gum side up, and a half chance it'll land gum side down, and if it lands gum side down *you can choose not to play*. There's a half chance you get $4 from seeing the flip, and a half chance you get nothing (because you don't play) from seeing the flip, and so $2 is the **VoI** of seeing one flip of the biased coin, *given your original prior*.

Notice that, even though it'd be impossible to figure out the 'true' chance that the coin will land gum down, you can model how much it would be worth it to you to figure that out. If I were able to tell you *p* directly, then you could choose to gamble only when *p*>.5, and you would earn an average of $3.^{4} One coin flip gives you two thirds of the value that perfect information would give you.

Also notice that you need to **change your decision** to get any value out of more information. Suppose that, instead of letting you choose whether or not to gamble, your friend made you decide, flipped two coins, and then paid you if the second coin landed gum down and you paid him. The coin is flipped the same number of times, but you're worse off because you have to decide with less information.

It's also worth noting that multimodal distributions- where there are strong clusters rather than smooth landscapes- tend to have higher VoI. If we knew the biased coin would either always come up heads or always come up tails, and expected each case were equally likely, then seeing one flip is worth $6, because it's a half chance of a guaranteed $12.

## Choosing where to invest

Here's an example I came across in my research:

Kleinmuntz and Willis were trying to determine the value of doing detailed anti-terrorism assessments in the state of California for the Department of Homeland Security. There are hundreds of critical infrastructure sites across the state, and it's simply not possible to do a detailed analysis of each site. There are terrorism experts, though, who can quickly provide an estimate of the risk to various sites.

They gave a carefully designed survey to those experts, asking them to rate the relative probability that a site would be attacked (conditioned on an attack occurring) and the probability that an attack would succeed on a scale from 0 to 10, and the scale of fatalities and economic loss on a logarithmic scale from 0 to 7. The experts were comfortable with the survey^{5} and able to give meaningful answers.

Now Kleinmutz and Willis were able to take the elicited vulnerability estimates and come up with an estimated score for each facility. This estimated score gave them a prior over detailed scores for each site- if the experts all agreed that a site was a (0, 1, 2, 3), then that still implies a range over actual values. The economic loss resulting from a successful attack (3) could be anywhere from $100 million to $1 billion. (Notice that having a panel of experts gave them a natural way to determine the spread of the prior beyond the range inherent in their answers- where the experts agreed, they could clump the probability mass together, with only a little on answers the experts didn't give, and where the experts disagreed they knew where to spread the probability out over.) They already had, from another source, data on the effectiveness of the risk reductions available at the various sites and the costs of those reductions.

The highest actual consequence elicited was for $6 billion, assuming a value of $6 million per life. The highest VoI of getting a detailed site analysis, though, was only $1.1 *million*. From the definition, this shouldn't be that surprising- VoI is only large when you would be surprised or uncertainty is high. For some sites, it was obvious that DHS should invest in reducing risk; in others, it was obvious that DHS shouldn't invest in reducing risk. The detailed vulnerability analysis would just tell them what they already knew, and so wouldn't provide any value. Some sites were on the edge- it might be worthwhile to reduce risk, it might not. For those sites, a detailed vulnerability analysis would provide value- but because the site was on the edge, the expected value of learning more was necessarily small!^{6} Remember, for VoI to be positive you have to **change your decision**, and if that doesn't happen there's no VoI.

Distressingly, they went on to consider the case where risk reduction could not be performed without a detailed vulnerability analysis. Then, rather than measuring VoI, they were mostly measuring the value of risk reduction- and the maximum value shot up to $840 million. When Bayesian evidence is good enough, requiring legal evidence can be costly.^{7}

## Medical Testing

About two years ago, I was sitting at my computer and noticed a black dot on my upper arm. I idly scratched it, and then saw its little legs move.

It was an tick engorged on my blood, which I had probably picked up walking through the woods earlier. I removed it, then looked up online the proper way to remove it. (That's the wrong order, by the way: you need the information before you make your decision for it to be of any use. I didn't do it the proper way, and thus increased my risk of disease transmission.)

Some ticks carry Lyme disease, and so I looked into getting tested. I was surprised to learn that if I didn't present any symptoms by 30 days, the recommendation was against testing. After a moment's reflection, this made sense- tests typically have false positive rates. If I didn't have any symptoms after 30 days, even if I took the test and got a positive result the EV could be higher for no treatment than for treatment. In that case, the VoI of the test would be 0- **regardless of its outcome, I would have made the same decision.** If I saw symptoms, though, then the test would be worthwhile, as it could distinguish Lyme disease from an unrelated rash, headache, or fever. "Waiting for symptoms to appear" was the test with positive VoI, not getting a blood test right away.

One could argue that the blood test could have "peace of mind" value, but that's distinct from VoI. Even beyond that, it's not clear that you would get positive peace of mind on net. Suppose the test has a 2% false positive rate- what happens when you multiply the peace of mind from a true negative by .98, and subtract the costs of dealing with the false positives by .02? That could easily be negative.

(I remain symptom-free; either the tick didn't have Lyme disease, didn't transfer it to me, or my immune system managed to destroy it.)

## Choosing a Career

Many careers have significant prerequisites: if you want to be a doctor, you're going to have to go to medical school. People often have to choose where to invest their time with limited knowledge- you can't know what the career prospects will be like when you graduate, how much you'll enjoy your chosen field, and so on. Many people just choose based on accumulated experience- lawyers were high-status and rich before, so they suspect becoming a lawyer now is a good idea.^{8}

Reducing that uncertainty can help you make a better decision, and VoI helps decide what ways to reduce uncertainty are effective. But this example also helps show the limits of VoI: VoI is best suited to situations where you've done the background research and are now considering further experiments. With the biased coin, we started off with a uniform prior; with the defensive investments, we started off with estimated risks. Do we have a comparable springboard for careers?

If we do, it'll take some building. There's a lot of different value functions we could build- it probably ought to include stress, income (both starting and lifetime)^{9}, risk of unemployment, satisfaction, and status. It's not clear how to elicit weights on those, though. There's research on what makes people in general happy, but you might be uncomfortable just using those weights.^{10}

There are also hundreds, if not thousands, of career options available. Prior distributions on income are easy to find, but stress is harder to determine. Unemployment risk is hard to predict over a lifetime, especially as it relies on macroeconomic trends that may be hard to predict. (The BLS predicts employment numbers out 10 years from data that's a few years old. It seems unlikely that they're set up to see crashes coming, though.)

Satisfaction is probably the easiest place to start: there are lots of career aptitude tests out there that can take self-reported personality factors and turn that into a list of careers you might be well-suited for. Now you have a manageable decision problem- probably somewhere between six and twenty options to research in depth.

What does that look like from a VoI framework? You've done a first screening which has identified places where more information might *alter your decision*. If you faint at the sight of blood, it doesn't matter how much surgeons make, and so any time spent looking that up is wasted. If you do a quick scoring of the six value components I listed above (after brainstorming for other things relevant to you), just weighting them with those quick values may give you good preliminary results. Only once you know what comparisons are relevant- "what tradeoff between status and unemployment risk am I willing to make?"- would you spend a long time nailing down your weights.

This is also a decision problem that could take a long, long time. (Even after you've selected a career, the option to switch is always present.) It can be useful to keep upper and lower bounds for your estimates and update those along with your estimates- their current values and their changes with the last few pieces of information you found can give you an idea of how much you can expect to get from more research, and so you can finish researching and make a decision at a carefully chosen time, rather than when you get fatigued.

# Conclusion

Let's take another look at the definition: how much *answering* a question allows a decision-maker to *improve* its *decision*.

The "answering" is important because we need to consider all possible answers.^{11} We're replacing one random variable with two random variables- in the case of the biased coin, it replaced one unknown coin (one flip) with either the lucky coin and the unlucky coin (two flips- one to figure out which coin, one to bet on). When computing VoI, you can't just consider one possible answer, but all possible answers considering their relative likelihood.^{12}

The "improve" is important because VoI isn't about sleeping better at night or covering your ass. If you don't expect to change your decision after receiving this information, or you think that the expected value of the information (the chance you change your decision times the relative value of the decisions) is lower than the cost of the information, just bite the bullet and don't run the test you were considering.

The "decision" is important because this isn't just curiosity. Learning facts is often fun, but for it to fit into VoI some decision has to depend on that fact. When watching televised poker, you know what all the hands are- and while that may alter your enjoyment of the hand, it won't affect how any of the players play. You shouldn't pay much for that information, but the players would pay quite a bit for it.^{13}

1. Persi Diaconis predicts most human coin flips are fair to 2 decimals but not 3, and it's possible through training to bias coins you flip. With a machine, you can be precise enough to get the coin to come up the same way every time.

2. There is one thing that isn't coin-related: your friend is offering you this gamble, and probably has information you don't. That suggests the deal favors him- but suppose that you and your friend just thought this up, and so neither of you has more information than the other.

3. Your profit is 24*p-*12; your distribution on *p* is P(*p*)=2p, and so your distribution on profit is 48p^{2}-24p integrated from 0 to 1, which is 4.

4. Again, your profit is 24*p-*12; you have a uniform distribution on what I will tell you about *p*, but you only care about the section where *p*>.5. Integrated from .5 to 1, that's 3.

5. Whenever eliciting information from experts, make sure to repeat back to them what you heard and ensure that they agree with it. You might know decision theory, but the reason you're talking to experts is because they know things you don't. Consistency can take a few iterations, and that's to be expected.

6. A common trope in decision analysis is "if a decision is hard, flip a coin." Most people balk at this because it seems arbitrary (and, more importantly, hard to justify to others)- but if a decision is hard, that typically means both options are roughly equally valuable, and so the loss from the coin flip coming up the wrong value is necessarily small.

7. That said, recommendations for policy-makers are hard to make here. Legal evidence is designed to be hard to game; Bayesian evidence isn't, and so Bayesian evidence is only "good enough" if it's not being gamed. Checking your heuristic (i.e. the expert's estimates) to keep it honest can provide significant value. Performing detailed vulnerability analysis on some (how many?) randomly chosen sites for calibration is often a good choice. Beyond that, I can't do much besides point you to psychology to figure out good ways to diagnose and reduce bias.

8. It doesn't appear that this is the case anymore. The supply of lawyers has dramatically increased, and so wages are declining; as well, law is a pretty soul-crushing field from a stress, work-life balance, and satisfaction perspective. If law looks like the best field for you and you're not in it for the money or status, the advice I hear is to specialize in a niche field that'll put food on the table but stay interesting and tolerably demanding.

9. Both of these capture different information. A job with a high starting salary but no growth prospects might translate into more happiness than a job with a low starting salary but high growth prospects, for example.

10. Most of the happiness/satisfaction literature I've seen has asked people about their attributes and their happiness/satisfaction. That's not a randomized trial, though, and so there could be massive selection effects. If we find that engineers are collectively less happy than waiters, does that mean engineering causes unhappiness, unhappiness causes engineering, that unhappiness and engineering are caused by the same thing, or none of those?

11. Compare this with information theory, where bits are a property of answers, *not* questions. Here, VoI is a property of questions, *not* answers.

12. If you already know the cost of the information, then you can stop computing as soon as you find a positive outcome good enough and likely enough that the VoI so far is higher than the cost.

13. In high-stakes poker games, the VoI can get rather high, and the deceit / reading involved is why poker is a more interesting game than, say, the lottery.

## 60 comments

Comments sorted by top scores.

## comment by Vaniver · 2011-11-21T21:11:20.440Z · LW(p) · GW(p)

Background: lukeprog wrote this post about articles he wouldn't have the time to write, and the first one on the list was something I was confident about, and so I decided to write a post on it. (As a grad student in operations research, practical decision theory is what I spend most of my time thinking about.)

Amusingly enough, I had the most trouble working in his 'classic example.' Decision analysis tends to be hinged on Bayesian assumptions often referred to as "small world"- that is, your model is complete and unbiased (If you knew there was a bias in your model, you'd incorporate that into your model and it would be unbiased!). Choosing a career is more of a search problem, though- specifying what options you have is probably more difficult than picking from them. You can still use the VoI concept- but mostly for deciding when to stop accumulating new information. Before you've done your first research, you can't predict the results of your research very well, and so it's rather hard to put a number on how valuable looking into potential careers is.

There seems to be a lot of interest in abstract decision theory, but is there interest in more practical decision analysis? That's the sort of thing I suspect I could write a useful primer on, whereas I find it hard to care about, say, Sleeping Beauty.

Replies from: steven0461, Kaj_Sotala, Michael_Sullivan, thejash, Oscar_Cunningham, Vaniver, WrongBot## ↑ comment by steven0461 · 2011-11-21T21:42:59.387Z · LW(p) · GW(p)

There seems to be a lot of interest in abstract decision theory, but is there interest in more practical decision analysis? That's the sort of thing I suspect I could write a useful primer on

Please do! This is exactly the sort of topic that should be LessWrong's specialty.

Replies from: lukeprog## ↑ comment by Kaj_Sotala · 2011-11-22T08:36:52.962Z · LW(p) · GW(p)

There seems to be a lot of interest in abstract decision theory, but is there interest in more practical decision analysis? That's the sort of thing I suspect I could write a useful primer on

My reaction while reading this post was "whoa, this seems really valuable, and the sort of thing that should have been discussed on LW *years* ago". So yes, please write more.

## ↑ comment by Michael_Sullivan · 2011-11-24T02:29:44.232Z · LW(p) · GW(p)

I, too, find it hard to care about Sleeping Beauty, which is perhaps why this post is the first time in years of reading LW, that I've actually dusted off my math spectacles fully and tried to rigorously understand what some of this decision theory notation actually means.

So count me in for a rousing endorsement of interest in more practical decision theory.

## ↑ comment by thejash · 2011-11-24T05:21:39.800Z · LW(p) · GW(p)

Please write an article about "practical decision analysis". I tried to learn about this briefly before, but didn't learn anything useful. I must be missing the right keywords and phrases that are used in the field, so I would definitely appreciate an overview, or anything that helps improve everyday decision making.

## ↑ comment by Oscar_Cunningham · 2011-11-21T22:36:03.725Z · LW(p) · GW(p)

Bayesian assumptions often referred to as "small world"- that is, your model is complete and unbiased.

Side question: Why are these called "small world" assumptions? I've heard the term before but didn't understand it there either.

Replies from: Vaniver, Kaj_Sotala, thomblake## ↑ comment by Vaniver · 2011-11-22T13:51:09.458Z · LW(p) · GW(p)

I was introduced to the term by Binmore's Rational Decisions. Amusingly, he asks what small worlds are on page 2 but doesn't get around to answering the question until page 117.

Essentially, a "small world" is one in which you can "look before you leap." When playing Chess by the rules, you could in theory determine every possible position which could be legally reached from the current position. If you have a sufficiently good model of your opponent and know your decision strategy, you could even assign a probability on every terminal board position in that tree. (This world may not seem very small because there are combinatorially many states!)

A large world is one in which you cannot cross some bridges until you get to them. The example given by Binmore is that, at one time, people thought the world was flat; now, they think it's round. That's a process that *could* be described by Bayesian updating, but it's not clear that's the best way to do things. When I think the world is flat, does it make much sense to enumerate *every possible* way for the world to be non-flat and parcel out a bit of belief to each? I would argue against such an approach. Wait until you discover that the Earth is roughly spherical, then work from there. That is, parcel out some probability to "world is not flat" and then, when you get evidence for that, expand on it. In a "small world," everything is expanded from the beginning.

This happens in many numerical optimization problems. Someone in my department (who defended their PhD yesterday, actually) was working on a decision model for Brazilian hydroelectric plants. They have to decide how much water stored in dams to use every month, and face stochastic water inflows. The model looks ahead by four years to help determine how much water to use this month, but it only tells you how much water to use this month. There's no point in computing a lookup table for next month, because next month you can take the actual measurements for the most recent month (which you have probability ~0 to predict exactly) and solve the model again, looking ahead four years based on the most recent data.

## ↑ comment by Kaj_Sotala · 2011-11-22T08:45:01.578Z · LW(p) · GW(p)

I presume it's because actually having a *complete* model about a problem requires looking at a problem that is small enough that you can actually know all the relevant factors. This is in contrast to e.g. problems in the social sciences, where the amount of things that might possibly affect the result - the size of the world - is large enough that you can never have a complete model.

As another example, many classic AI systems like SHRDLU fared great when in small, limited domains where you could hand-craft rules for everything. They proved pretty much useless in larger, more complex domains where you ran into a combinatorial explosion of needed rules and variables.

## ↑ comment by thomblake · 2011-11-21T22:43:26.741Z · LW(p) · GW(p)

I had assumed that the term related to small-world network (math) though it doesn't seem to have quite the same application.

## comment by JenniferRM · 2011-11-22T19:50:49.065Z · LW(p) · GW(p)

At a meetup on Sunday I made an appeal to the value of information and it wasn't simply the end of that line of conversation because I ran into inferential distance problems... and then this article shows up the next day! Thanks for writing it :-)

You mentioned in passing in the comments that you were a grad student in OR like it wasn't a big deal, but from my perspective that's a gigantic and fascinating area in itself. You probably have the ability to write several articles popularizing relatively basic content in operations research and my guess is that they would all end up on the main page. One of them that would probably be good is an article on the educational paths and career prospects in operations research like which grad schools are the best and how to get into them and what you can do with it afterwards. I suspect that a lot of people on LW would enjoy such work, be good at it, and be at a stage of life where they could put the advice into practice.

## comment by NancyLebovitz · 2011-11-21T22:20:44.922Z · LW(p) · GW(p)

My prior assumption is that if someone offers me a weird bet about which I have very little information, the odds are stacked against me some way or other. The only reason I'd take the bet would be if the amount is small enough that I want to see what the trick is.

Replies from: Vaniver, Vladimir_Nesov## ↑ comment by Vaniver · 2011-11-22T14:07:25.302Z · LW(p) · GW(p)

See footnote 2.

My favorite story along these lines (that I bought for $20):

I'm walking through Atlanta when I'm stopped by a man who seems likely to be homeless. He compliments my shoes, and tells me that, even beyond that, he recognizes the kind of shoe. He bets me $10 that he can tell me the city and state where I got those shoes.

I stop and think for a moment. $10 is small enough, and I'm curious to see what the trick is, so I agree, but ask for a few seconds to remember where I got the shoes. He nods and patiently waits, then I tell him to go ahead. He launches into a clearly prepared speech, which ends with "Never play a man at his own game. I didn't say I'd tell you where you bought, purchased, or acquired those shoes- I told you I'd tell you where you got them. And you got them on your feet in Atlanta, Georgia."

I smiled, pulled out my wallet, and didn't have a $10 bill, so I gave him a $20.

(The gum example is a great way to trick your friends, incidentally- most people's intuitions are wrong about which side is more likely.)

## ↑ comment by Vladimir_Nesov · 2011-11-22T14:05:37.457Z · LW(p) · GW(p)

This is discussed in footnote 2. (And focusing on this problem isn't in the spirit of LCPW.)

## comment by steven0461 · 2011-11-22T00:54:39.351Z · LW(p) · GW(p)

Here's a useful rule: if you're faced with a choice between two alternatives, and you have some probability distribution for the difference in utility between them, it's a mistake to pay more than one standard deviation's worth of utility for knowing which is better. (If the distribution is symmetric, it's a mistake to pay more than half a standard deviation.)

(Because the expected benefit from switching is less than (half) the expected absolute deviation, which is less than (half) the standard deviation. Right?)

I think it would be cool to have a list of other results like this that you could use to get bounds and estimates for the value of information in concrete situations.

Replies from: twanvl## ↑ comment by twanvl · 2011-11-23T14:43:34.464Z · LW(p) · GW(p)

This is not quite correct.

Suppose you know that difference in utility has a uniform distribution between 10 and 20. Then you already known which of the alternatives is better. So you shouldn't pay the standard deviation's worth (which is 2.88675).

The mean of the difference matters much more than the standard deviation. Math will follow.

Replies from: twanvl, steven0461## ↑ comment by twanvl · 2011-11-23T15:16:43.895Z · LW(p) · GW(p)

Math, as promised.

Suppose that the difference in utility is uniformly distributed,

```
U(b) - U(a) ~ Uniform(u,v)
```

Assume for simplicity that U(a)=0 and that E[U(b)] > 0, so that b is the better choice if there is no more information.

```
E[U(optimal|noinfo)] = E[U(b)] = (u+v)/2
E[U(optimal|info)] = integral_u^v dx if x<0 then 0 else x
= if 0 <= u <= v then (u+v)/2
if u <= 0 <= v then (0+v)/(v-u)*(0+v)/2 = v^2/2(v-u)
```

So, if uU(a).

If the difference is normally distributed with mean m and standard deviation s.

```
U(b) - U(a) = U(b) ~ Normal(m,s)
```

Then

```
E(U|no info) = E[U(b)] = m
E(U|info) =
-- thank you, mathematica
Assuming[s > 0, Integrate[x PDF[NormalDistribution[m, s], x], {x, 0, Infinity}]]
= 1/2 (m + Exp[-m^2/(2 s^2)] Sqrt[2/pi] s + m Erf[m/(Sqrt[2] s)])
= s*normpdf(m/s) + m*normcdf(m/s)
```

A reasonable opproximation seems to be

```
E[U|info) ~= 0.4 s Exp[-2 (m/s)] + m
```

So, you should be willing to pay 0.4*s*Exp[-2 (m/s)]. That means that you should pay exponentially less for each standard deviation that the mean is greater than 0.
When the mean difference is 0, so when both are apriori equally likely, the information is worth s/sqrt(2pi) ~= 0.4 s.
When the mean difference is one standard deviation in favor of b, the information is only worth 0.0833155 s.

To summarize: the more sure you are of which choice is best, the less the information that tells you that for certain is worth.

Replies from: steven0461## ↑ comment by steven0461 · 2011-11-23T20:06:16.259Z · LW(p) · GW(p)

To summarize: the more sure you are of which choice is best, the less the information that tells you that for certain is worth.

Yes, but that was clear without math.

So, you should be willing to pay 0.4sExp[-2 (m/s)]. That means that you should pay exponentially less for each standard deviation that the mean is greater than 0. When the mean difference is 0, so when both are apriori equally likely, the information is worth s/sqrt(2pi) ~= 0.4 s. When the mean difference is one standard deviation in favor of b, the information is only worth 0.0833155 s.

Thanks, I could see the 0.4 and 0.08 becoming useful rules of thumb. How much does it matter that you assumed symmetry and no fat tails?

## ↑ comment by steven0461 · 2011-11-23T20:02:08.634Z · LW(p) · GW(p)

I said "it's a mistake to pay more than one standard deviation's worth", not "one should pay exactly a standard deviation's worth".

## comment by **[deleted]** ·
2011-11-23T01:49:09.350Z · LW(p) · GW(p)

This is highly related to Bayesian surprise, the KL-divergence between a posterior and prior distribution. I'm actually working on some research right now to try to see in the setting of PAC-learning whether correlated statistical queries regarding Bayesian surprise buys you something in terms of learning efficiency. If a theoretical result can be found, it would suggest that "salience" truly is a good way to focus attention and so proxies for salience w.r.t. Nature's distribution would be advisable in practice much the same way that proxies for Kolmogorov complexity are advisable in practice when simplicity is of importance (such as gzip which is actually used when studying genetic sequences).

A further result I would like to think about is what role Bayesian surprise queries (or VoI queries) might play in scientific research at large. To me, VoI-type reasoning leads me to believe our current journal / conference system is deeply flawed if the goal is to update some sort of community prior to a better posterior efficiently. It will be interesting to see where this leads.

Replies from: gwern## comment by Dr_Manhattan · 2011-11-22T13:09:34.907Z · LW(p) · GW(p)

Great post on a very important topic.

One suggestion: pictures would help a lot here. Norvig's AIMA has some very nice illustrations in ch 16 (I think)

P(p)=2x; in the second, P(p)=2-2x

Not sure how you worked this out. Not clear what X is

Replies from: Vaniver, darius## ↑ comment by Vaniver · 2011-11-22T14:01:41.201Z · LW(p) · GW(p)

I like the idea of having pictures but I do not like the idea of procuring pictures. I'll make it a higher priority for future posts, though, and if someone wants to send me pictures (which I can legally use) for this post I'll be happy to edit them in.

I replaced the "x"s with "p"s; hopefully that'll make it a bit clearer.

We start off with a prior P(p)=1. That is, I think every p is equally likely, and when I integrate over the domain of p (from 0 to 1) I get 1, like I should.

Then I update on seeing heads. For each p value, the chance I saw heads was p- and so I expect my function to have the functional form P(p)=p. Notice that after seeing heads I think the mode is a coin that always lands on heads and that it's impossible that the coin always lands on tails- both are what I expect. When I integrate p from 0 to 1, though, I get 1/2. I need to multiply it by 2 to normalize it, and so we have P(p)=2p.

This might look odd at first because it sounds like the probability of the coin always landing on heads is 2, which suggests an ill-formed probability. That's the probability *density*, though- right now, my prior puts *0* probability on the coin *always* landing on heads, because that's an integral with 0 width.

The 2-2x comes from the same argument, but the form is now 1-x.

Replies from: Michael_Sullivan, Dr_Manhattan## ↑ comment by Michael_Sullivan · 2011-11-24T02:26:03.795Z · LW(p) · GW(p)

I'm not sure it isn't clearer with 'x's, given that you have two different kinds of probabilities to confuse.

It may just be that there's a fair bit of inferential distance to clear, though in presenting this notation at all.

I have a strong (if rusty) math background, but I had to reason through exactly what you could possibly mean down a couple different trees (one of which had a whole comment partially written asking you to explain certain things about your notation and meaning) before it finally clicked for me on a second reading of your comment here *after* trying to explain my confusion in formal mathematical terms.

I think a footnote about what probability distribution functions look like and what the values actually represent (densities, rather than probabilities), and a bit of work with them would be helpful. Or perhaps there's enough inferential work there to be worth a whole post.

Replies from: Vaniver## ↑ comment by Dr_Manhattan · 2011-11-22T14:24:20.899Z · LW(p) · GW(p)

Just to be clear I was not suggesting ripping off their illustration, but it is a very good one worthy of legal emulation :).

Replies from: Vaniver## comment by gwern · 2011-11-23T16:20:04.933Z · LW(p) · GW(p)

Wikipedia: http://en.wikipedia.org/wiki/Value_of_information

## comment by gwern · 2012-05-18T23:46:36.110Z · LW(p) · GW(p)

My attempt to apply this to my own experiments with nootropics etc: http://lesswrong.com/r/discussion/lw/cih/value_of_information_8_examples/

## comment by gwern · 2012-05-17T03:47:52.862Z · LW(p) · GW(p)

VoI example: running studies http://ije.oxfordjournals.org/content/30/4/771.full

## comment by chaosmosis · 2012-05-02T19:59:59.883Z · LW(p) · GW(p)

8 It doesn't appear that this is the case anymore. The supply of lawyers has dramatically increased, and so wages are declining; as well, law is a pretty soul-crushing field from a stress, work-life balance, and satisfaction perspective. If law looks like the best field for you and you're not in it for the money or status, the advice I hear is to specialize in a niche field that'll put food on the table but stay interesting and tolerably demanding.

I was wondering if this is also the case for medical doctors?

Replies from: Vaniver## ↑ comment by Vaniver · 2012-05-07T03:05:36.759Z · LW(p) · GW(p)

Medical schools limit their seats much more effectively than law schools do, I believe, and so the supply of doctors hasn't increased as much as the supply of lawyers. I do hear that medicine is a less attractive field now than it was a decade or two ago, but that seems to be because of technological and legal changes more than economic chances. I am not a doctor or an expert in career choice, though.

## comment by ksvanhorn · 2011-12-24T19:19:34.950Z · LW(p) · GW(p)

"But this example also helps show the limits of VoI: VoI is best suited to situations where you've done the background research and are now considering further experiments."

Do you mean this in the sense that there is usually some low-hanging fruit (e.g. the background research itself) where the VOI is obviously so high that you there's no need to calculate it -- you obviously should obtain the information?

I think Douglas Hubbard, author of How to Measure Anything, makes a good case for making VOI calculations the default in important decisions. When he acts as a consultant, he first spends a couple of days training the decision makers in calibrating their probability assessments, and then they do a VOI calculation for all the important unknowns, based on those subjective probabilities. It's often precisely those questions for which they can't get much from background research and haven't even considered measuring -- because they don't know how to -- that have the highest VOI.

Maybe these cases are atypical, as they are biased towards difficult decisions that warrant hiring a consultant. But difficult decisions are the *raison d'etre* for the field of decision analysis.

## ↑ comment by Heka · 2013-12-26T23:11:53.938Z · LW(p) · GW(p)

Hubbard talks about measurement inversion. "In a business case, the economic value of measuring a variable is usually inversely proportional to how much measurement attention it usually gets." This thread contains discussion about possible reasons for it. The easiness/familiarity aspect that you imply is probably one of them. Others include declining marginal value of information on certain subject and the destabilizing effect new measurements might have for an organization.

It's easy to imagine that measurement inversion would also apply to common measurements on personal life.

## ↑ comment by beoShaffer · 2011-12-24T19:40:03.264Z · LW(p) · GW(p)

When he acts as a consultant, he first spends a couple of days training the decision makers in calibrating their probability assessments, and then they do a VOI calculation for all the important unknowns, based on those subjective probabilities.

As I'm not Vaniver I can't say for sure, but the way I read it doing the background research refers to the step were you get the subjective probabilities and identify the important unknowns.

## comment by Luke_A_Somers · 2011-11-25T23:01:28.105Z · LW(p) · GW(p)

(I remain symptom-free; either the tick didn't have Lyme disease, didn't transfer it to me, or my immune system managed to destroy it.)

... or you're one of the many people who get Lyme disease without presenting the typical symptoms the first time round. I know someone whose entire family has it and not one of them ever got the typical rash (and if you get that rash, there's really no missing it).

Replies from: Vaniver## comment by Anubhav · 2012-03-04T14:16:20.882Z · LW(p) · GW(p)

If it helps anyone, I found the explanation in the video here (under "Decision Models") much easier to follow than Vaniver's.

I mean, hey, it has pretty pictures!!

Don't get me wrong, this post is excellent. And it's much more detailed than the video. But I ran into an inferential chasm the first time I tried to read it, and I can't be the only one.

## comment by capybaralet · 2016-09-26T22:48:41.335Z · LW(p) · GW(p)

Does anyone have any insight into VoI plays with Bayesian reasoning?

At a glance, it looks like the VoI is usually not considered from a Bayesian viewpoint, as it is here. For instance, wikipedia says:

""" A special case is when the decision-maker is risk neutral where VoC can be simply computed as; VoC = "value of decision situation with perfect information" - "value of current decision situation" """

From the perspective of avoiding wireheading, an agent should be incentivized to gain information even when this information decreases its (subjective) "value of decision situation". For example, consider a bernoulli 2-armed bandit:

If the agent's prior over the arms is uniform over [0,1], so its current value is .5 (playing arm1), but after many observations, it learns that (with high confidence) arm1 has reward of .1 and arm2 has reward of .2, it should be glad to know this (so it can change to the optimal policy, of playing arm2), BUT the subjective value of this decision situation is less than when it was ignorant, because .2 < .5.

Replies from: Vaniver## ↑ comment by Vaniver · 2016-09-27T21:38:12.948Z · LW(p) · GW(p)

There shouldn't be any conflicts between VoI and Bayesian reasoning; I thought of all of my examples as Bayesian.

From the perspective of avoiding wireheading, an agent should be incentivized to gain information even when this information decreases its (subjective) "value of decision situation". For example, consider a bernoulli 2-armed bandit:

I don't think that example describes the situation you're talking about. Remember that VoI is computed in a forward-looking fashion; when one has a (1, 1) beta distribution over the arm, one thinks it is equally likely that the true propensity of the arm is above .5 and below .5.

The VoI comes into that framework by being the piece that agitates for exploration. If you've pulled arm1 seven times and gotten 4 heads and three tails, and haven't pulled arm2 yet, the expected value of pulling arm1 is higher than pulling arm2 but there's a fairly substantial chance that arm2 has a higher propensity than arm1. Heuristics that say to do something like pull the level with the higher 95th percentile propensity bake in the VoI from pulling arms with lower means but higher variances.

If, from a forward-looking perspective, one does decrease their subjective value of decision situation by gaining information, then one shouldn't gain that information. That is, it's a bad idea to pay for a test if you don't expect the cost of the test to pay for the additional value. (Maybe you'll continue to pull arm1, regardless of the results of pulling arm2, as in the case where arm1 has delivered heads 7 times in a row. Then switching means taking a hit for nothing.)

One thing that's important to remember here is conservation of expected evidence--if I believe now that running an experiment will lead me to believe that arm1 has a propensity of .1 and arm2 has a propensity of .2, then I should *already believe* those are the propensities of those arms, and so there's no subjective loss of well-being.

## comment by **[deleted]** ·
2016-09-23T11:41:10.609Z · LW(p) · GW(p)

Okay, VoI aside, how would you bet in the following setup:

There are three 5 copecks coins, randomly chosen. Each one is dropped 20 times (A0, B0, C0). Then a piece of gum is attached to the heads of Coin A (AGH) & it is dropped 20 times; to the tails of Coin A (AGT); to the heads (BGH) or tails (BGT) of Coin B, & to the tails of Coin C (CGT). Coin C is dropped three times, and the gum attached to the side which appeared two of them. Then, Coin C is dropped twenty times (CGX). The numbers are as follows: A0: heads 14/20, AGT heads 10/20, AGH heads 7/20. B0: heads 8/20, BGT heads 8/20, BGH heads 8/20 (I guess I need to hoard this one.) C0: heads 10/20, CGT heads 11/20, CGX heads 14/20. To what side of Coin C was gum applied in CGX?

## comment by **[deleted]** ·
2016-09-17T21:03:29.519Z · LW(p) · GW(p)

Where do you get the exact "half-chance of nothing because you don't play"? How do you decide to play or not, given a favorable outcome of the test run?

Replies from: Vaniver## ↑ comment by Vaniver · 2016-09-18T22:27:06.383Z · LW(p) · GW(p)

You should play if the expected value is positive, and not if it's negative. If the test run results in heads, then the posterior probability is 2/3rds and 24*2/3-12=4, which is positive. If the test run results in tails, then the posterior probability is 1/3rd and 24*1/3-12=-4, which is negative.

(Why is the posterior probability 2/3 or 1/3? Check out footnote 3, or Laplace's Rule of Succession.)

## comment by **[deleted]** ·
2016-09-17T20:43:03.950Z · LW(p) · GW(p)

But what if your friend offers you to stick the gum to any other coin and let you see which way it lands, to get a feel on how the gum "might" affect the result*, and then offer you this deal? How would you calculate Vol then?

- I ask because I often run into the difference between "physiological" and "ecological" approaches. In the first instance, you might study (for example) "Plant X with/without Fungus Y0 and/or Bacteria Z0" microscope slides, where you carefully inoculate X. In the second, you make slides from X collected in the wild, with who-knows-what growing in it, and have to say if it has Y1 or Z1 or anything at all. I mean, having a previous "physiological" study at hand sure helps, but...are there any quantitative estimates on
*how much*?

## ↑ comment by Vaniver · 2016-09-18T22:44:31.038Z · LW(p) · GW(p)

This tends to be very context dependent; I don't know enough about biology to estimate. The main caution here is that people tend to forget about regression to the mean (if you have a local measurement X that's only partly related to Y, you should not just port your estimate from X over to Y, but move it closer to what you would have expected from Y beforehand).

## comment by prase · 2011-11-25T13:05:40.987Z · LW(p) · GW(p)

Your expected profit is $24p-12, where p is the probability the coin lands gum down. This is a good deal if p>.5, but a bad deal if p<.5.

You surely wanted 0.3333, not 0.5.

Replies from: twanvl## comment by Johnicholas · 2011-11-22T16:48:53.648Z · LW(p) · GW(p)

I'm a little confused about "value of information" as a precise concept.

Suppose that you have a tree with two kinds of interior nodes in it, and leaves labeled with utilities. One interior node is a choice node, and the other is a nature node, with an associated distribution over its subtrees. It's fairly obvious that you can work backwards up the tree, and find both an optimum strategy and an expected value of the entire tree.

However, I don't see where "value of information" shows up in this framework anywhere. Do I need to distinguish some choice nodes as "information gathering", and apply a default strategy to them as "don't gather any information", and then compute value of information as the difference between best I can do with my eyes closed and the best I can do flat out?

What if there is no natural partition of some actions as information gathering and not-information-gathering?

Is there some homomorphism from the first tree (which is normally a tree of evidence or belief states) to an "externally visible" tree, where two nodes are identified if the only difference is inside my head?

Replies from: Vaniver, TruePath## ↑ comment by Vaniver · 2011-11-22T19:18:09.759Z · LW(p) · GW(p)

However, I don't see where "value of information" shows up in this framework anywhere. Do I need to distinguish some choice nodes as "information gathering", and apply a default strategy to them as "don't gather any information", and then compute value of information as the difference between best I can do with my eyes closed and the best I can do flat out?

Think of VoI as going in the reverse direction. That is, beforehand you would have modeled your test outcome as a nature node because you didn't consider the option of not running the test. Now you stick in a choice node of "run the test" that leads to the nature node of the test output on the one branch, and the tree where you don't know the test output on the other branch. Like you suggest, you then use the work-backwards algorithm to figure out the optimal decision at the "run the test" node, and the difference between the branch node values is the absolute value of the VoI minus the test cost.

What if there is no natural partition of some actions as information gathering and not-information-gathering?

Then VoI won't help you very much. VoI is a concept that helps in specifying decision problems- building the tree- not computing a tree to find an optimal policy. It suggests modeling information-gathering activities as choice nodes leading to nature nodes, rather than just nature nodes. If you've got a complete decision problem already, then you don't need VoI.

I should point out that most tests aren't modeled as just information-gathering. If a test is costless, then why not run it, even if you throw the results away? Typically the test will have some cost, in either utility or prospects, and so in some sense there's rarely actions that are purely information gathering.

Replies from: TruePath## ↑ comment by TruePath · 2011-12-20T08:53:51.474Z · LW(p) · GW(p)

Think of VoI as going in the reverse direction. That is, beforehand you would have modeled your test outcome as a nature node because you didn't consider the option of not running the test. Now you stick in a choice node of "run the test" that leads to the nature node of the test output on the one branch, and the tree where you don't know the test output on the other branch. Like you suggest, you then use the work-backwards algorithm to figure out the optimal decision at the "run the test" node, and the difference between the branch node values is the absolute value of the VoI minus the test cost.

The problem with this model is that it doesn't necessarily give you the value of INFORMATION. Making the 'get info' node a choice point on the tree essentially allows arbitrary changes between the with info and without info branches of the tree. In other words it's not clear we are finding the value of information and not some other result of this choice.

That is why I choose to phrase my model in terms of getting to look at otherwise hidden results of nature nodes.

## ↑ comment by TruePath · 2011-12-20T08:48:18.955Z · LW(p) · GW(p)

Here is a nice formal model.

Let T be a tree of height w and T' a subset of T regarded as the set of choice nodes in T. Now given a function f on T' assume there is a function P giving a probability measure P_f on the set of paths through T' coherent with f for each f. Further assume there is a function u taking paths through T to utilities and define U_P(f) to be the integral of u with respect to P_f over all paths coherent with f and U(f|\sigma) to be the integral of u with respect to P_f over all paths extending \sigma coherent with f. Now either assume that each choice node has finitely many successors and U(f|\sigma) is always finite. By compactness we can always achieve a upper bound. We now define the value of information of some set S of non-choice nodes in T (i.e. the benefit of getting to look at those nodes in making future decisions).

Let E(T) (the informed maximum) be the max of U(f) for f a function on T' and let E(T/S) be the max of U(f) over those functions f satisfying f(\sigma)=f(\tau) if {x|\sigma(x) \neq \tau(x) } is a subset of S, i.e. those functions unaware of the result of nodes in S.

The value of getting to look at the results of nodes in S is thus E(T) - E(T/S). Note that to match the concept in the article S should be an initial segment of T (i.e. you get all that info immediately while our model includes the value of buying info that will only be revealed later).

Replies from: Johnicholas## ↑ comment by Johnicholas · 2011-12-20T20:47:49.940Z · LW(p) · GW(p)

Interesting..

So the original tree T models a tree of "states of affairs", and the original partition or subset T' models each node being either under the decision-maker's control or not. Then the function f would go from elements of T' to successors of those elements - nodes of T, to be sure, but there's a side condition there. Then the probability measure P is a somewhat more powerful way of attaching probabilities to the non-choice nodes - that is, if you have a distribution over the successors of each non-choice node, then you can obtain a probability measure, but a probability measure over paths would allow some additional correlations. The function u can be understood (in a finite tree) as labelling leaves with utilities, because paths of maximum length in a finite tree are isomorphic with leaves - but by describing it the way you did, you leave the door open to applying this formalism to an infinite tree. UP(f) would be the utility of a particular strategy (f), and U(f|\sigma) would be... the utility given a certain initial sequence of events? So \sigma is a finite path segment?

I don't understand the grammar of "Now either .... and ..." - should it be "Now either ... or ..."? Or is it really "Now assume ... and ..."?

When you use U(f) later, I am guessing that's either UP(f) with the P elided, or U(f | the empty path segment) - regardless, we're going to have to fix a P in order to get a utility, right?

Then the phrase "f(sigma)=f(tau) if the set of x such that \sigma(x) is not equal to tau(x) is a subset of S" - If I understand correctly, sigma and tau are finite path segments, which are isomorphic to nodes in the tree - but are they functions from nodes? If they are functions, wouldn't they be from, say, initial sections of the integers 0...n TO nodes? If I understand correctly, they're going to diverge at at most one point - once diverged, since it's a tree, they're not going to be able to rejoin. Or were you saying tree and thinking 'dag'?

I worry about the types of these things; coding it up in something like Haskell or Ocaml might make everything sharper and perhaps suggest simplifications. I'm sure that you can carry through the basic intuition.

Thanks for thinking about this.