A Proper Scoring Rule for Confidence Intervals

post by Scott Garrabrant · 2018-02-13T01:45:06.341Z · LW · GW · 47 comments

Contents

47 comments

You probably already know that you can incentivise honest reporting of probabilities using a proper scoring rule like log score, but did you know that you can also incentivize honest reporting of confidence intervals?

To incentize reporting of a confidence interval, take the score , where is the size of your confidence interval, and is the distance between the true value and the interval. is whenever the true value is in the interval.

This incentivizes not only giving an interval that has the true value of the time, but also distributes the remaining 10% equally between overestimates and underestimates.

To keep the lower bound of the interval important, I recommend measuring and in log space. So if the true value is and the interval is , then is and is for underestimates and for overestimates. Of course, you need questions with positive answers to do this.

To do a confidence interval, take the score .

This can be used to make training calibration, using something like Wits and Wagers cards more fun. I also think it could be turned into app, if one could get a large list of questions with numerical values.

47 comments

Comments sorted by top scores.

comment by Scott Garrabrant · 2018-02-13T01:57:14.687Z · LW(p) · GW(p)

EDIT: I originally said you can do this for multiple choice questions, which is wrong. It only works for questions with two answers.

(In a comment, to keep top level post short.)

One cute way to do calibration for probabilities, is to construst a spinner. If you have a true/false question, you can construct a spinner which is divided up according to your probability that each answer is the correct answer.

If you were to then spin the spinner once, and win if it comes up on the correct answer, this would not incentize constructing the spinner to represent your true beliefs. The best strategy is to put all the mass on the most likely answer.

However, if you spin the spinner twice, and win if either spin lands on the correct answer, you are actually incentivized to make the spinner match your true probabilities!

One reason this game is nice is that it does not require having a correctly specified utility function that you are trying to maximize in expectation. There are only two states, win and lose, and as long as winning is prefered to losing, you should construct your spinner with your true probabilities.

Unfortunately this doesnt work for the confidence intervals, since they seem to require a score that is not bounded below.

Replies from: SilentCal, Benito
comment by SilentCal · 2018-02-15T16:53:30.663Z · LW(p) · GW(p)

Two spins only works for two possible answers. Do you need N spins for N answers?

Replies from: Scott Garrabrant
comment by Scott Garrabrant · 2018-02-15T18:16:38.140Z · LW(p) · GW(p)

You are correct. It doesn't work for more than two answers. I knew that when I thought about this before, but forgot. Corrected above.

I dont have a nice algorithm for N answers. I tried a bunch of the obvious simple things, and they dont work.

Replies from: neil-fitzgerald
comment by Neil Fitzgerald (neil-fitzgerald) · 2018-02-19T13:18:29.862Z · LW(p) · GW(p)

I think an algorithm for N outcomes is: spin twice, gain 1 every time you get the answer right but lose 1 if both guesses are the same.

One can "see intuitively" why it works: when we increase the spinner-probability of outcome i by a small delta (imagining that all other probabilities stay fixed, and not worrying about the fact that our sum of probabilities is now 1 + delta) then the spinner-probability of getting the same outcome twice goes up by 2 x delta x p[i]. However, on each spin we get the right answer delta x q[i] more of the time, where q[i] is the true probability of outcome i. Since we're spinning twice we get the right answer 2 x delta x q[i] more often. These cancel out if and only if p[i] = q[i]. [Obviously some work would need to be done to turn that into a proof...]

Replies from: gjm
comment by gjm · 2018-02-19T15:18:05.467Z · LW(p) · GW(p)

Just to be clear: if you spin twice and both come up right, you're gaining 2 and then losing 1? (I.e., this is equivalent to what you wrote in an earlier version of the comment?)

Replies from: neil-fitzgerald
comment by Ben Pace (Benito) · 2018-02-13T08:00:29.965Z · LW(p) · GW(p)

(Why does the two-spin work?)

Replies from: Scott Garrabrant
comment by Scott Garrabrant · 2018-02-13T08:15:52.156Z · LW(p) · GW(p)

In a true/false question that is true with probability , if you assign probability , your probability of losing is . (The probabily the answer is true and you spin false twice plus the probability the answer is false and you spin true twice.)

This probability is minimized when its derivative with respect to is , or at the boundary. This derivative is , whis is when . We now know the minimum is achieved when is , , or . The probability of losing when is . The probability of losing when is . The probability of losing when is , which is the lowest of the three options.

Replies from: Scott Garrabrant, Scott Garrabrant
comment by Scott Garrabrant · 2018-02-13T08:45:34.104Z · LW(p) · GW(p)

Copied without LaTeX:

In a true/false question that is true with probability p, if you assign probability q, your probability of losing is p(1−q)^2+(1−p)q^2. (The probabily the answer is true and you spin false twice plus the probability the answer is false and you spin true twice.)

This probability is minimized when its derivative with respect to q is 0, or at the boundary. This derivative is −2p(1−q)+2(1−p)q, whis is 0 when q=p. We now know the minimum is achieved when q is 0, 1, or p. The probability of losing when q=0 is p. The probability of losing when q=1 is 1−p. The probability of losing when q=p is p(1−p), which is the lowest of the three options.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2018-02-13T17:09:53.426Z · LW(p) · GW(p)

This is called either Brier or quadratic scoring, not sure which.

Replies from: kevin-s-van-horn
comment by Kevin S. Van Horn (kevin-s-van-horn) · 2018-02-15T18:24:46.455Z · LW(p) · GW(p)

Not exactly. Its expected value is the same as the expected value of the Brier score, but the score itself is either 0 or 1.

comment by Scott Garrabrant · 2018-02-13T08:29:02.646Z · LW(p) · GW(p)

For some reason, the latex is not rendering for me. I can see it when I edit the comment, but not otherwise.

Replies from: Benito
comment by Ben Pace (Benito) · 2018-02-13T08:55:13.629Z · LW(p) · GW(p)

The comment has just started rendering for me.

Edit: Oh wait no, you just added another comment without LaTex.

Replies from: habryka4
comment by habryka (habryka4) · 2018-02-13T08:58:02.174Z · LW(p) · GW(p)

Huh, that’s really weird. The server must somehow be choking on the specific LaTeX you posted. Will check it out.

Replies from: habryka4
comment by habryka (habryka4) · 2018-02-13T09:04:27.923Z · LW(p) · GW(p)

Ok, I found the bug. I will fix it in the morning.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2018-02-14T16:57:24.951Z · LW(p) · GW(p)

And you did! Cheers for your hard work. :)

comment by Blake Riley (blake-riley) · 2018-02-14T11:59:12.864Z · LW(p) · GW(p)

This is an underappreciated fact! I like how simple the rule is when framed in terms of size and distance.

You mention both the linear and log rules. The log rule has the benefit of being scale-invariant, so your score isn't affect by the units the answer is measured in, but it can't deal with negatives and gets overly sensitive around zero. The linear rule doesn't blow up around zero, is shift-invariant, and can handle negative values fine. The best generic scoring rule would have all these properties.

Turns out (based on Lambert and Shoham, "Eliciting truthful answers to multiple choice questions") that all scoring rules for symmetric confidence intervals with coverage probability can be represented (up to affine transformation) as

where is the true value, is the indicator function, and is any increasing function. Unsurprisingly, the linear rule uses and the log rule uses . If we want scale-invariance on the whole real line, first thing I'd be tempted to do is use for positive and for negative except for that pesky bit about going off to around zero. Let's paste in a linear portion around zero so the function is increasing everywhere:

Using this , the score is sensitive to absolute values around zero and sensitive to relative values on both sides of it. Since the rule expects more accuracy around zero, the origin should vary depending on question domain. Like if the question is about dates, accuracy should be the highest around the present year and get less accurate going into the past or future. That suggests we should set the origin at the present year. For temperatures, the origin should probably be room temperature. Are there any other standard domains that should have a non-zero origin? An alternate origin can be added as a shift everywhere:

Not something you'd want to calculate by hand, but if someone implements a calibration app, this has more consistent scores. Going one step further, the scores could be made more intepretable by comparison to a perfectly calibrated reference score: where is the expected score for perfectly calibrated intervals if, say, and is a fixed value chosen to keep plausible scores mostly positive.

comment by lukeprog · 2018-08-29T17:48:22.643Z · LW(p) · GW(p)

This scoring rules has some downsides from a usability standpoint. See Greenberg 2018, a whitepaper prepared as background material for a (forthcoming) calibration training app.

comment by StefanDeYoung · 2018-02-13T15:47:53.002Z · LW(p) · GW(p)

I need help figuring out how to use this scoring rule. Please consider the following application.

How much does it cost to mail a letter under 30g in Canada?

I remember when I was a child buying 45c stamps, so it's likely to be larger than that. It's been over a decade or so, and assuming a 2% rise in cost per year, then we should be around c per stamp. However, we also had big budget cuts to our postal service that even I learned about despite not reading the news. Let's say that Canada Post increased their prices by 25% to accomodate some shortfall. My estimate is that stamps cost 75c.

What should be my confidence interval? Would I be surprised if a stamp cost a dollar? Not really, but it feels like an upper bound. Would I be surprised if a stamp cost less than 50c? Yes. 60c? Yes. 70c? Hmmm.... Assume that I'm well calibrated, so I'm reporting 90% confidence for an interval of stamps costing 70c to 100c.

Answer: Stamps in booklets cost 85c each, individual stamps are 100c each. Because I would always buy stamps in booklets, I will use the 85c figure.

S is the size of my confidence interval, . D is the distance between the true value and the interval, but is 0 in this case because the true value is in the interval.

I'm not really sure what to do with this number, so let's move to the next paragraph of the post.

The true value is and the interval is . Because the true value is contained in the interval, .

How does this incentivise honest reporting of confidence intervals?

Let's say that, when I intuited my confidence interval above that I was perturbed that it wasn't symmetric about my estimate of 75c, so I set it to for aesthetic reasons. In this case, my score would be Which is worse than my previous score by a factor of 2.

Let's say that, when I remembered the price of stamps in my childhood, I was way off and remembered 14c stamps. Then I would believe that stamps should cost around 22c now. (Here I have the feeling of "nothing costs less than a quarter!", so I would probably reject this estimate.)That would likely anchor me, so that I would set a high confidence on the price being within

,

Am I trying to maximize this score?

I looked up the answer, and the lowest cost standard delivery is for letters under 30g.

Replies from: Scott Garrabrant, Qiaochu_Yuan, kulya-botaniki, sil-ver
comment by Scott Garrabrant · 2018-02-14T00:31:25.099Z · LW(p) · GW(p)

I messed up, and swapped the words overestimate and underestimate in the 4th paragraph. I fixed it now. Score should always be negative.

This will change the value at the end to , or , making the score .

This score is a very negative number, so you get punished for having a bad interval, relative to the above.

comment by Qiaochu_Yuan · 2018-02-14T00:27:15.857Z · LW(p) · GW(p)

The idea is that the two terms in the score balance between two effects: trying to make S as small as possible means making your interval as small as possible, but if you make it too small you're more likely to use an interval which doesn't contain the truth. Trying to make D as small as possible means making your interval more likely to contain the truth. The coefficients balance the tradeoff between the two so that the interval you end up with is your 90% confidence interval. (According to Scott; I haven't verified this personally.)

Replies from: gjm, StefanDeYoung
comment by gjm · 2018-02-14T01:18:28.463Z · LW(p) · GW(p)

I have verified it. I was in the process of writing a (fairly lengthy) reply to Stefan's comment, including a proof that Scott's scoring rule does indeed have the property that your expected score (according to your actual beliefs about the quantity you're estimating) is maximized when the confidence interval you state has (again according to your actual beliefs) a 5% chance that the quantity lies below its lower bound and a 5% chance that the quantity lies above its upper bound ... but then something I did (I have no inkling what, though it coincided with some combination of keypresses as I was trying to enter some mathematics) made the page go entirely blank, and I didn't find any way to get my partially-written comment back again.

Anyway, here's one way (I don't guarantee it's best and it feels like there should be a slicker way) to prove it. Let's suppose the confidence interval you state is (l,r); consider the derivative w.r.t. either of those bounds -- let's say r, but l is similar -- of your expected score. The first term in the score is just l-r, and the derivative of that is always -1. The second term can be written as an integral; differentiating it w.r.t. r turns out to give you 20Pr(X>r). (The calculation is easy.) So the derivative is zero only when 1-20Pr(X>r)=0; that is, when Pr(X>r)=5%. So if the confidence interval you state doesn't have the property that you expect to be above it exactly 5% of the time, then this derivative is nonzero and therefore some small change in r increases your expected score.

Replies from: Dr_Manhattan
comment by Dr_Manhattan · 2018-02-14T18:42:56.766Z · LW(p) · GW(p)

would you mind spelling out the integral part?

Replies from: gjm
comment by gjm · 2018-02-15T14:44:20.347Z · LW(p) · GW(p)

Suppose f is your probability density function for the quantity X you're interested in.

Then the expectation of D is the integral of D(x)f(x), which equals the integral of [max(0,l-x)+max(0,x-r)]f(x). When we differentiate w.r.t. r, the first term obviously goes away because it's independent of r, so we get the integral of [d/dr max(0,x-r)] f(x). That derivative is 0 for x<r and 1 for x>r, so this is the integral of f(x) from r upwards; in other words it's Pr(X>r). So d(score)/dr = 1-20Pr(X>r).

The calculation for l is exactly the same but with a change of sign; we end up with 20Pr(X<l)-1.

comment by StefanDeYoung · 2018-02-15T15:17:39.379Z · LW(p) · GW(p)

Thanks for this reply. The technique of asking what each term of your equation represents is one I have not practiced in some time.

This answer very much helped me to understand the model.

comment by Куля Ботаніки (kulya-botaniki) · 2018-02-14T15:17:18.676Z · LW(p) · GW(p)

Thank you for providing an example!

Replies from: StefanDeYoung
comment by StefanDeYoung · 2018-02-15T15:13:11.307Z · LW(p) · GW(p)

You're welcome. Something that I'm trying to improve about how I engage with lesswrong is writing out either a summary of the article (without re-refering to the article) or an explicit example of the concept in the article. My hope is that this will help me to actually grok what we're discussing.

comment by Rafael Harth (sil-ver) · 2018-02-13T22:30:10.460Z · LW(p) · GW(p)

I get a dozen 'refresh to render LaTeX's here (but refreshing doesn't fix it).

Replies from: habryka4
comment by habryka (habryka4) · 2018-02-13T22:53:28.068Z · LW(p) · GW(p)

Just wrapped up the fix for this. Pushing the fix in the next few minutes.

Replies from: habryka4
comment by habryka (habryka4) · 2018-02-13T23:35:00.388Z · LW(p) · GW(p)

Fixed! Sorry for the inconvenience!

comment by habryka (habryka4) · 2019-11-24T18:22:06.115Z · LW(p) · GW(p)

This is one of these simple insights that I think are actually important for a large variety of forecasting related tasks, and that turn out to be a large pain to rederive for yourself.

comment by philip-bl · 2018-02-15T23:54:08.793Z · LW(p) · GW(p)

Is there a way to incentivize reporting true probability distribution? Say I Bob wants Alice to provide her probability distribution of IQ she'll get on the test. He is willing to give her a real number as a reward, he wants to hear her probability distribution of her result. What should he do?

Would be nice if it worked for both discrete and non-discrete probability spaces.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2018-02-16T00:24:18.761Z · LW(p) · GW(p)

In the discrete case log scoring still works, it generalizes past the binary case.

That is, if is the set of possible outcomes of the test, Bob elicits from Alice a probability distribution on , then Alice takes the test and gets some outcome , then Bob rewards Alice . (This number is unfortunately always negative; you can add a positive constant to it if you want.)

Alice's expected payoff according to her true probability distribution is

also known as the (negative of the) cross entropy between and . And you can do a computation, e.g. with Lagrange multipliers, which will verify that for fixed , the optimal value of is . I do this calculation in this blog post.

A test isn't a good example to use because the outcome of the test is under Alice's control, so she can e.g. throw the test and predict this fact. This procedure is best used to elicit Alice's prediction of something which she cannot influence in any way.

Replies from: habryka4
comment by habryka (habryka4) · 2018-02-16T00:59:50.136Z · LW(p) · GW(p)

How did using LaTeX fail?

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2018-02-16T01:42:56.446Z · LW(p) · GW(p)

I tried starting with a dollar sign, which brought up a yellow prompt that I couldn't figure out how to easily exit; hitting Enter just started a new line in the prompt. The only way I've found to exit it so far is Ctrl + Enter, which submits the comment with the LaTeX displaying as "refresh to display LaTeX," and continuing to display that after I refresh.

Replies from: habryka4
comment by habryka (habryka4) · 2018-02-16T02:18:34.693Z · LW(p) · GW(p)

Ah, you exit the yellow prompt with Esc and in the yellow prompt you can type any LaTeX, with a live-preview beneath it.

Somewhat surprised that it continued to show "refresh to display LaTeX' even after you refreshed. I never had that happen to me. That might have been a result of you submitting from the inside of the prompt, which I can imagine causing errors.

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2018-02-16T02:31:00.858Z · LW(p) · GW(p)

Awesome, everything's fine now.

comment by Ben Pace (Benito) · 2018-02-15T11:42:42.273Z · LW(p) · GW(p)

Incentivising accurate probabilistic predictions is central to any art of rationality, this post gives a signficant part of this that's super readable, so I've curated it.

(Also, nice move adding extra points in the comments.)

comment by ChristianKl · 2018-02-21T21:18:07.062Z · LW(p) · GW(p)

Do you have some argument that your proposed formula are optimal?

Replies from: Scott Garrabrant
comment by Scott Garrabrant · 2018-02-21T22:14:39.742Z · LW(p) · GW(p)

What do you mean by optimal?

If you mean they are proper (i.e. incentivize honest reporting), gjm's comment gives a quick sketch of a proof.

comment by tunesmith · 2021-08-28T23:58:53.316Z · LW(p) · GW(p)

Is there a way to adjust this to support better scores for tighter confidence intervals?

For instance, using natural log, with a range of 8-10 and a true value of 10, I get -0.2231 whether I pick a 90% confidence interval, or a 95% confidence interval (coefficient of 40). It'd be nice if the latter scored better.

comment by Kevin S. Van Horn (kevin-s-van-horn) · 2018-02-15T00:04:11.976Z · LW(p) · GW(p)

[Edit: I'm retracting this comment, as I made some incorrect assumptions about Scott's claim.] This is wrong. It is well known that the only strictly proper scoring rule that depends only on the probability at the actually occurring value is the logarithmic scoring rule (if there are more than two alternatives), or translations and/or positive scaling of the same. In this case, that would be log(Normal(x | mu, sigma)), where x is the value that occurs, and mu and sigma^2 are the mean and variance of the normal distribution that fits the interval you defined at the given confidence level. This may be simplified to

-log(sigma^2) - (x - mu)^2 / sigma^2.

Your scoring rule is not a translation and/or positive scaling of the logarithmic scoring rule.

Replies from: Scott Garrabrant, Scott Garrabrant
comment by Scott Garrabrant · 2018-02-15T00:45:45.527Z · LW(p) · GW(p)

Throwing out an attempt to resolve the disagreement, sorry if this is actually what we are disagreeing about:

Am unknownigly using words that imply that I care about normal distributions? I am imagining getting honest reporting out of an agent trying to maximize expected score, but with arbitrary beliefs. I am only trying to get an honest reporting of the subjective 5th and 95th percentiles, and am not trying to get any other information.

Replies from: kevin-s-van-horn
comment by Kevin S. Van Horn (kevin-s-van-horn) · 2018-02-15T23:07:47.858Z · LW(p) · GW(p)

I'm used to seeing normal (or log-normal) distributions fit to subjective confidence intervals -- because the confidence intervals are being used to do some subjective probabilistic analysis. I assumed that was what you were doing, given that you were using the actual attained value x, and not just which of the three possibilities A:(x < left), B:(left < x < right), and C:(right < x) occurred.

Hmmm... you seem to have evaded the theorem about the only strictly proper local scoring rule being the logarithmic score, by only seeking to find the confidence interval, but using more information than just the region (A, B, or C) the outcome belongs to.

It would help to see a proof of the claim; do you have a reference or a link to a URL giving the proof?

Replies from: Scott Garrabrant
comment by Scott Garrabrant · 2018-02-15T23:15:46.488Z · LW(p) · GW(p)

I dont have a reference. gjm's comment gives a quick sketch.

Replies from: Scott Garrabrant
comment by Scott Garrabrant · 2018-02-15T23:22:27.190Z · LW(p) · GW(p)

Oh, a quick thing thats not a proof that may convince you it is true:

It works exactly the same way as saying that measuring the distance between reported value and true value incentivizes honest reporting of your median. (The point you think it the true value is above with probability 50%)

comment by Scott Garrabrant · 2018-02-15T00:38:34.432Z · LW(p) · GW(p)

This scoring rule does not depend only on the probability at the actually occuring value. You dont even report the probability at any value. I am not trying to incentivize reporting of probabilities of specific value, I am trying to incentivize reporting an interval such that the person reporting the belief believes the point will lie in with probability .

Your rule seems to be trying to do something else, but it will not incentivize me giving my confidence interval in cases where my beliefs are not normally distributed.