Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

post by William_S · 2020-02-19T01:07:52.394Z · LW · GW · 5 comments

This is a question post.

Is there an intuitive way to explain how much better superforecasters are than regular forecasters? (I can look at the tables in https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions but I don't have an intuitive understanding of what brier scores mean, so I'm not sure what to think about it).

Answers

answer by Daniel Kokotajlo · 2020-02-19T16:08:21.849Z · LW(p) · GW(p)

Like ignoranceprior said, my AI Impacts post has three intuitive ways of thinking about the results:


Way One: Let’s calculate some examples of prediction patterns that would give you Brier scores like those mentioned above. Suppose you make a bunch of predictions with 80% confidence and you are correct 80% of the time. Then your Brier score would be 0.32, roughly middle of the pack in this tournament. If instead it was 93% confidence correct 93% of the time, your Brier score would be 0.132, very close to the best superforecasters and to GJP’s aggregated forecasts.14 In these examples, you are perfectly calibrated, which helps your score—more realistically you would be imperfectly calibrated and thus would need to be right even more often to get those scores.

Way Two: “An alternative measure of forecast accuracy is the proportion of days on which forecasters’ estimates were on the correct side of 50%. … For all questions in the sample, a chance score was 47%. The mean proportion of days with correct estimates was 75%…”15 According to this chart, the superforecasters were on the right side of 50% almost all the time:16

Way Three: “Across all four years of the tournament, superforecasters looking out three hundred days were more accurate than regular forecasters looking out one hundred days.”17 (Bear in mind, this wouldn’t necessarily hold for a different genre of questions. For example, information about the weather decays in days, while information about the climate lasts for decades or more.)

answer by Jsevillamol · 2020-02-19T13:55:37.264Z · LW(p) · GW(p)

Brier scores are scoring three things:

  • How uncertain the forecasting domain is (because of this Brier scores are not comparable between domains - if I have a high Brier score in short term weather predictions and you have a low Brier score on geopolitical forecasting that does not imply I am a better forecaster than you)
  • How well-calibrated is the forecaster (eg we would say that a forecaster is well-calibrated if 80% of the predictions that he assigned 80% confidence to actually come true)
  • How much information does a forecaster convey in their predictions (eg if I am predicting coin flips and say 50% all the time, my calibration will be perfect but I will not be conveying extra information)

Note that in Tetlock's research there is no hard cutoff from regular forecasters to superforecasters - he arbitrarily declared that the top 2% were superforecasters, and showed that 1) the top 2% of forecasters tended to remain in the top 2% between years and 2) that some of the techniques they used for thinking about forecasts could be shown in an RCT to improve the forecasting accuracy of most people.

answer by Davidmanheim · 2020-02-19T17:27:01.541Z · LW(p) · GW(p)

This is a bit complicated, but to start, we can answer this question related to only the types of questions we have empirical data from superforecasters about. That's because the fact that superforecasters do better is an empirical observation, not a clear predictive/quantitative theory about what makes people better or worse. I'm going to use data from the AIImpacts blog post - https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project-an-accompanying-blog-post/ - because I don't have the book or the datasets handy right now.

The original tournament was about short and medium term geopolitical and similar questions. The scoring used time-weighted brier scores, and note that brier scores themselves are question-set specific. For these questions, an aggregate of superforecaster predictions had about 60-70% lower brier scores than the control group of "regular" forecasters. The best superforecaster had a score of 0.14, while the no-skill brier score on these questions, which is if someone just assigns equal probability to everything, is 0.53. But that's not the right comparison if comparing superforcasting to forecasters. The average of forecasters (including supreforecasters, it seems) was close to 0.35. If we adjust to 0.4 to roughly remove superforecasters, 65% lower than that is 0.14 - the same score as the best superforecaster.

How is that possible? Aggregation. And the benefits of aggregation aren't due to the skill of superforecasters, they are due to the law of large numbers - so maybe we don't want to give the superforecasters credit for being better, but superforecasting does include it.

So how do we understand a brier score? It's the average squared distance from being correct, i.e. 1 or 0. That means that a brier score of .14 means that on average, you predicted things that did / did not happen were 65% / 35% likely. But we had a time-weighted average score - if someone predicted 50% on day 1, and went steadily down to 20% at the close of the question, and it resolves negatively, my average prediction is 35%, and my brier score is 0.14.

answer by romeostevensit · 2020-02-19T02:40:00.736Z · LW(p) · GW(p)

One way is how far out people can predict before their predictions get as noisy as chance. One of the surprising findings of the GJP was that even the best forecasters decayed to chance around the one year mark (IIRC?). Normal people can't even predict what has already happened ( they object to/can't coherently update on basic facts about the present).

5 comments

Comments sorted by top scores.

comment by ignoranceprior · 2020-02-19T02:57:45.670Z · LW(p) · GW(p)

This AI impacts article includes three intuitive ways to think about the findings.