# Calibration of a thousand predictions

post by KatjaGrace · 2022-10-12T08:50:16.768Z · LW · GW · 7 comments

## Contents

```  Example predictions
Notes
None
```

I’ve been making predictions in a spreadsheet for the last four years, and I recently got to a thousand resolved predictions. Some observations:

1. I’m surprisingly well calibrated for things that mostly aren’t my own behavior1. Here’s the calibration curve for 630 resolved predictions in that class:

I don’t know what’s up with the 80% category, but the average miscalibration of the eleven categories is <3%.

At risk of bragging, this seems wild to me. My experience of making these predictions is fairly well described as ‘pulling a number out of thin air’2. But apparently if you take all these conjured numbers, and look at the 45 of them that fell in the vicinity of 40%, then I implicitly guessed that 17.28 of those events would happen. And in fact 18 of them happened. WTF? Why wasn’t it eight of them or thirty-five of them? And that was only the fifth most accurate of the eleven buckets shown above! For predictions in the vicinity of 70%, I was off by 0.15%—I said 54.88 of those 80 things would happen, and in fact 55 of them happened.

Possibly people overall are just better calibrated than I thought. I had some remembered view that people’s asserted 90% confidence intervals were really 50% confidence intervals or something, but I can’t immediately find such evidence now, and I can find various graphs of groups of people being on average fairly calibrated. And the handful of PredictionBook users I could find with more than a thousand predictions are not hugely worse.

If you are curious about what I predicted, I put examples at the end of this post.

2. For the entire thousand predictions—the above plus 370 about my own behavior— I’m off by 6.25% on average (up from 2.95%) over the same eleven buckets.

3. As you may infer, I’m pretty bad overall at predicting my own behavior!

This is more what I expected of a calibration curve—broadly overconfident. And perhaps its general worseness is explained by the appeal of optimism in predicting oneself. But it’s a pretty weird shape, which seems less explicable. If I think there’s a 40% chance that I’ll do something, apparently it’s not happening. If you want it to happen, you should hope I change my mind and put 5% on it!

I’m not sure what is up with this particular strange shape. But note that making predictions about one’s own behavior has particular complication, if one likes to be right. If you put a number below 50% on taking an action, then you have a disincentive to doing it. So you should then put a lower probability on it than you would have, which would make you even more wrong if you took the action, so you have a further disincentive to doing it, etc. I do generally look for a fixed point where given that I put probability p on something (and the incentive consequences of that), I do think it will happen with probability p. But this is a different process than the usual predicting process, and I could imagine it going wrong in strange ways. For instance, if I’m more motivated by being right than I thought, then 40% predictions which might have been 50% predictions originally should really be 5% predictinons. This theory doesn’t really work though, because then shouldn’t the lower categories also be especially overconfident? Whereas in fact they are okay.

(Maybe I just have free will? The kind of free will that manifests as being about 15% less likely to do anything than one might have expected seems disappointing, but the menu of possible free will options was never that inspiring.)

## Example predictions

Here are some typical predictions, arbitrarily pulled from my spreadsheet and lightly edited:

• I will be invited to play ONUW today: 0.45 (true)
• The trial bank transfers come through to my IBKR account by the end of Monday: 0.34 (false)
• [Friend] will want to leave here for the day before I do: 0.05 (false)
• [Friend] will seem notably sad in demeanor when I talk to [them]: 0.6 (false)
• I will be paid by end Jan 27: 0.85 (true)
• If I go inside shortly I see [friend]: 0.08 (false)
• We have the [organization] party here: 0.55 (true)
• I go to math party today: 0.88 (true)
• I will get my period on Tuesday: 0.10 (true)
• We will be invited to work at [office] for at least a week: 0.75 (false)
• On Feb 5 we (including [friend], me, [friend]) are renting a new place: 0.73 (true)
• We will run the arts and crafts room auction today (i.e. by midnight we will have as much info from the auction as we will get re which room is whos, ignoring processing of info we have): 0.40 (true)
• [Person] is leading a new EAish org or CEO or COO of an existing EAish org by May 30 2023, where EAish includes orgs not culturally EA but doing things that are considered so by a decent fraction of EAs: 0.62 (TBD)
• I will get to the office in time for lunch: 0.95 (true)
• I see [housemate] tonight before midnight: 0.88 (true)
• If I ask [attendee] ‘what did you think of [event I ran]?’, [they] will be strongly positive: 0.8 (false)
• I see [friend]’s dad tonight before midnight: 0.95 (forgot to notice)
• If I offer [person] a trial [they] will take it: 0.65 (true)
• If I look at [friend]’s most recent Tweet, it is linking to [their] blog: 0.8 (false)
• My weight is under [number] again before it is over [number] again: 0.75 (false)
• I do all of my Complice goals tomorrow: 0.3 (false)
• I will go to the office on Friday: 0.6 (true)
• I will read the relevant chapter of Yuval book by the end of 2nd: 0.1 (false)
• I weigh less than [weight] the first time I weigh myself tomorrow: 0.65 (true)
• Lunch includes no kind of fish meat: 0.43 (true)

And some examples of own-behavior marked predictions:

• Work goal as stated will be completed by end of day: start a document of feedback policies: 0.90 (true)
• I ok [person]’s post before the meeting: 0.68 (false)
• Work goal as stated, will be completed by Sunday 28th October: Respond to [person]: 0.80 (true)
• Work goal as stated will be completed by midnight Sunday 30 September 2018: read [person]’s research: 0.4 (false)
• Work goal as stated, will be completed by Sunday 4th November: Arrange to talk to [other researcher] about [employee] project […]: 0.3 (false)
• Work goal as stated will be completed by midnight Sunday 30 September 2018: Think for 1h about [colleague] thing: 0.5 (true)
• I have fewer than 1k emails in inbox at some point on Feb 10th: 0.87 (true)
• I have written to [brother] by Feb 10th: 0.82 (true)
• I will be home by 9pm: 0.97 (true)

Categories missing from these randomishly selected lists but notable in being particularly fun:

1. Predictions of history that I don’t know or remember, followed by looking it up on Wikipedia. A pretty fun combination of predicting things and reading Wikipedia.
2. Predictions of relationship persistence in successive episodes of Married at First Sight.

## Notes

1. I have a column where I write context on some predictions, which is usually that they are my own work goal, or otherwise a prediction about how I will behave. This graph excludes those, but keeps in some own-behavior prediction which I didn’t flag for whatever reason.)

2. Except maybe more like art—do you know that feeling where you look at the sketch, and tilt your head from side to side, and say ‘no, a little bit more… just there….mmm…yes…’? It’s like that: ‘27%…no, a little more, 29%? No, 33%. 33%, yes.’ Except honestly it’s more ridiculous than that, because my brain often seems to have views about which particular digits should be involved. So it’s like, ‘23%…no, but mmm 3…33%, yes.’ I am generally in favor of question decomposition and outside views and all that, but to be clear, that’s not what I’m doing here. I might have been sometimes, but these are usually fast intuitive judgments.

comment by benjamincosman · 2022-10-12T13:25:25.115Z · LW(p) · GW(p)

Possibly people overall are just better calibrated than I thought.

Note that the relevant reference class is not "people overall"; at the risk of overfitting, I'd say it should be something closer to "people who are mathematically literate, habitually make tons of predictions, and are at least aware of the concept of calibration". It is far less surprising (though still surprising, I think) that a member of this group is this well calibrated.

Replies from: GWS
comment by Stephen Bennett (GWS) · 2022-10-13T05:28:49.447Z · LW(p) · GW(p)

It's nice to see that Katja is pretty well calibrated. Congratulations to her!

I remember listening to a podcast that had Daniel Khaneman on as a guest. The host asked Daniel (paraphrasing) 'Hey, so people have all these biases that keep them from reasoning correctly. What could I do do to correct them?', and Daniel responded 'Oh, there's no hope there. You're just along for the ride, system 1 is going to do whatever it wants' and I just felt so defeated. There's really no hope? There's not a way that we might think more clearly. I take this as a pretty big success, and a nice counterexample to Danny's claim that people are irredeemably irrational.

Replies from: AllAmericanBreakfast
comment by DirectedEvolution (AllAmericanBreakfast) · 2022-10-13T15:41:27.895Z · LW(p) · GW(p)

"Speak for yourself, Danny!"

comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-10-12T13:35:36.104Z · LW(p) · GW(p)

I have tried calibration testing myself on stuff not about me or people I know. I've noticed that after a bit of practice, checking my curve every ten questions or so, I get quite accurate. When I take a break of a few months and try again, I notice that my curve has wandered and looks more like your curve for personal predictions. A bit of practice gets me back to accurately calibrated. I've repeated this process a few times and feel like the amount of wander-off-calibrated send similar each time, even with varying length intervals (few months vs years). I wonder if the calibration would be stickier if I practiced longer/harder at it? Why am I consistently worse around 40/60 than around 70/30, 80/20, 90/10? What would a typical calibration curve look like for different age elementary school kids? Would their calibrations throughout life be better if they were taught this and rehearsed it every few months through 3rd to 5th grade?

comment by ztzuliios · 2022-10-13T21:03:01.334Z · LW(p) · GW(p)

I don’t know what’s up with the 80% category

Interestingly I've had the same issue, though I'm also not as well calibrated at the lower levels as you are, I also have a noticable calibration dip at around 80%.

comment by Ben (ben-lang) · 2022-10-13T11:03:32.656Z · LW(p) · GW(p)

Very interesting and well done. You didn't detail the methodology much. One issue I am interested in is subconscious cheating in the following way. I make a bunch of predictions saying "their is a 80% chance of X", and I find that only 60% of these X are actually occurring. So I recalibrate (which is fair). However, lets say I start putting a load of things that really should be in my 95% bucket into the 80% bucket, so calibrate too far the other way. I went from being wrong in one direction to wrong in the other, but the average will still look good. A lot of mistakes in one direction can be (consciously or not) made up for by intentionally leaning too far the other way later.

comment by JacobW38 (JacobW) · 2022-10-13T06:25:58.913Z · LW(p) · GW(p)

It appears what you have is free won’t!

For the own-behavior predictions, could you put together a chart with calibration accuracy on the Y axis, and time elapsed between the prediction and the final decision (in buckets) on the X axis? I wonder whether the predictions became less-calibrated the farther into the future you tried to predict, since a broader time gap would result in more opportunity for your intentions to change.