# Range and Forecasting Accuracy

post by niplav · 2020-11-16T13:06:45.184Z · LW · GW · 17 comments## Contents

Range and Forecasting Accuracy and PredictionBook the Data For Metaculus For PredictionBook Different Analyses: An Illustrative Example Judging Between Forecasts Judging Between Questions Judging Within Questions Between Forecasts Analysis Why Some Negative Ranges? Results Why Assume Accuracy will Increase? Possible Explanations Range and Biased Questions Low Sample Sizes With High Ranges Between Questions Determining the Range of a Question Analysis Results Within Questions Analysis Result Sample Sizes Interlude: It's Under 102 None 15 comments

*cross-posted from niplav.github.io*

This text looks at the accuracy of forecasts in relation to the time between forecast and resolution, and asks three questions: First; is the accuracy higher between forecasts; Second; is the accuracy higher between questions; Third; is the accuracy higher within questions? These questions are analyzed using data from PredictionBook and Metaculus, the answers turn out to be yes, no and yes for Metaculus data; and no, no and no for PredictionBook data. Possible reasons are discussed.

# Range and Forecasting Accuracy

Above all, don’t ask what to believe — ask what to anticipate. Every question of belief should flow from a question of anticipation, and that question of anticipation should be the center of the inquiry. Every guess of belief should begin by flowing to a specific guess of anticipation, and should continue to pay rent in future anticipations. If a belief turns deadbeat, evict it.

*– Eliezer Yudkowsky, “Making Beliefs Pay Rent (in Anticipated Experiences)“ [LW · GW], 2007*

Probabilistic forecasting that aggregates both qualitative and quantitative methods is a comparatively simple idea. Basically, one needs to have only very few tools at one's disposal to being ready to start forecasting:

- View of belief as probabilistic (perhaps with some bayesian epistemology)
- Track records (grading results of forecasts using for example brier scores or log scores)
- Probability theory (a concept of probabilities, and maybe some simple probability distributions)

Since the 1980s, forecasting has slowly but surely matured from "X is going to happen because divine revelation told me so" to "my probability distribution on the outcome of this random variable is an X distribution with the following parameters", or alternatively "I assign a probability of X% to this event".

However, since this kind of forecasting is relatively recent, information about the accuracy of long-range forecasting is basically non-existent:

- Long-range forecasts are often stated too imprecisely to be judged for accuracy. More
- Even if a forecast is stated precisely, it might be difficult to find the information needed to check the forecast for accuracy. More
- Degrees of confidence for long-range forecasts are rarely quantified. More
- In most cases, no comparison to a “baseline method” or “null model” is possible, which makes it difficult to assess how easy or difficult the original forecasts were. More
- Incentives for forecaster accuracy are usually unclear or weak. More
- Very few studies have been designed so as to allow confident inference about which factors contributed to forecasting accuracy. More
- It’s difficult to know how comparable past forecasting exercises are to the forecasting we do for grantmaking purposes, e.g. because the forecasts we make are of a different type, and because the forecasting training and methods we use are different. More

*– Luke Muehlhauser, “How Feasible Is Long-range Forecasting?”, 2019*

In this text, I will try to look at the accuracy of short-term and mid-term forecasting, which may shine some light on the relation between the range of forecasts and their accuracy in general.

The *range* of a forecast is defined as the length of the timespan between
the forecast and the resolution of the forecast. Keeping with Muehlhauser
2019, I will define short-term
forecasts as forecasts with a range of less than a year, mid-range
forecasts as forecasts with a range between 1 and 10 years, and
long-term forecasts as forecasts with a range of more than 10 years
(this distinction is not central to the following analysis, though).

Fortunately, for short- and mid-range forecasts, two easily accessible sources of forecasts and their resolutions are available online: The two forecasting websites PredictionBook and Metaculus, frequented mostly by hobbyists.

To find out about the range of forecasts, I download, parse & analyse forecasting data from these sites using Python, and then analyze the data using Klong.

## Metaculus and PredictionBook

PredictionBook and Metaculus are both forecasting websites for hobbyists. They are not prediction markets, but rather function on the base of merit and track records: although you don't win money by being right, you can still boast about it (it is an open question whether other people will be impressed). Besides that, these sites make it easier to train ones calibration on real-world questions and become less wrong in the process.

However, both sites differ in their approach to writing questions and judging and scoring forecasts. PredictionBook is much older than Metaculus: the former was first released in 2008, the latter started in 2015. It is also much less formal than Metaculus: it doesn't require stringent resolution criteria, making possible for everybody to judge a question (unrelated to whether the person has even made a prediction on the question themselves!), while Metaculus requires a short text explaining the context and resolution criteria for a question, with the questions being resolved by moderators or admins. This leads to Metaculus having fewer questions than PredictionBook, but each question having more predictions on it. Of the two, Metaculus is much more featureful: It supports not only binary questions, but also range questions with probability distributions, comment threads, closed questions (questions that haven't yet been resolved, but that can't be predicted on), three different kinds of scores (the Brier score, and a logarithmic scoring rule for discrete and continuous forecasts each), as well as the Metaculus prediction, a weighted aggregation of the forecasts of the best forecasters on the site.

Another significant difference between these two websites is the amount of data they publish: PredictionBook shows every single forecast made, while on Metaculus one can only see the community forecast (a the time-weighted median of the forecasts made on the question). This is relevant for this analysis: The two approaches must be analysed separately.

## Getting the Data

First of all, the data for both platforms needs to be made available in a reasonable format. This works nicer for Metaculus, and is a bit more difficult to achieve for PredictionBook.

The resulting data from Metaculus is here, for PredictionBook it's available here.

### For Metaculus

The Metaculus data is relatively easy to obtain:
The forecasts are available on a JSON API at
`https://www.metaculus.com/api2/questions/?page=`

. Fortunately,
gimpf has already published a collection of
scripts for fetching &
analysing Metaculus data. I reused their script `fetch`

to download the
raw JSON. I then converted the distinct page objects in the generated
file to a list of questions:

```
$ cd /usr/local/src
$ git clone https://github.com/gimpf/metaculus-question-stats
$ cd metaculus-question-stats
$ ./fetch
$ z site
$ jq -s '[.]|flatten' </usr/local/src/metaculus/data-questions-raw.json >data/metaculus.json
```

The resulting data is available here.

I then wrote a python script to convert the JSON data to CSV in the form
`id,questionrange,result,probability,range`

, while also filtering out
yet unresolved questions and range questions. Here, `id`

is a unique
numerical ID per question, which will come in handy later, `questionrange`

is the duration between the time for creating and resolving the question,
`result`

is the result of the question (either 0 or 1), `probability`

is the probability given by the predictor , and `range`

is the
duration between the forecast and the resolution.

The script is not terribly interesting: It just reads in the JSON data, parses and traverses it, printing the CSV in the process.

Code:

```
#!/usr/bin/env python3
import json
import time
from time import mktime
f=open("../../data/metaculus.json")
jsondata=json.load(f)
for page in jsondata:
for question in page["results"]:
if question["possibilities"]["type"]=="binary" and (question["resolution"]==1 or question["resolution"]==0):
try:
restime=time.strptime(question["resolve_time"],"%Y-%m-%dT%H:%M:%S.%fZ")
except:
restime=time.strptime(question["resolve_time"],"%Y-%m-%dT%H:%M:%SZ")
try:
createtime=time.strptime(question["created_time"],"%Y-%m-%dT%H:%M:%S.%fZ")
except:
createtime=time.strptime(question["created_time"],"%Y-%m-%dT%H:%M:%SZ")
for pred in question["prediction_timeseries"]:
timediff=mktime(restime)-pred["t"]
qtimediff=mktime(restime)-mktime(createtime)
print("{},{},{},{},{}".format(question["id"], qtimediff, question["resolution"], pred["community_prediction"], timediff))
```

The resulting CSV file contains nearly 50k predictions.

### For PredictionBook

As far as I know, PredictionBook doesn't publish its data over an API. However, all individual predictions are visible on the web, which means I had to parse the HTML itself using BeautifulSoup.

This time the code is more complex, but just slightly so: It starts at the first page of predictions, and loops down to the last one, every time iterating through the questions on that page.

It then loops through the predictions on each question and parses out the date for the prediction and the credence.

Every question on PredictionBook has two dates related to its resolution: the 'known on' date, for which the resolution was originally planned, and by which the result should be known, and the 'judged on' date, on which the resolution was actually made. I take the second date to avoid predictions with negative differences between prediction and resolution time.

The output of this script is in the same format as the one for Metaculus
data: `id,questionrange,result,probability,range`

(although here
`probability`

can also be 0 and 1, which Metaculus doesn't allow).

Code:

```
#!/usr/bin/env python2
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from time import mktime
def showforecasts(linkp, res):
urlp="https://predictionbook.com{}".format(linkp)
reqp=urllib2.Request(urlp, headers={"User-Agent" : "Firefox"})
try:
conp=urllib2.urlopen(reqp, timeout=10)
except (urllib2.HTTPError, urllib2.URLError) as e:
return
datap=conp.read()
soupp=BeautifulSoup(datap, "html.parser")
timedata=soupp.find(lambda tag:tag.name=="p" and "Created by" in tag.text)
resolved=timedata.find("span", class_="judgement").find("span", class_="date created_at").get("title")
restime=time.strptime(resolved,"%Y-%m-%d %H:%M:%S UTC")
created=timedata.find("span", class_="date").get("title")
createtime=time.strptime(created,"%Y-%m-%d %H:%M:%S UTC")
responses=soupp.find_all("li", class_="response")
for r in responses:
forecasts=r.find_all("span", class_="confidence")
if forecasts!=[]:
est=float(r.find_all("span", class_="confidence")[0].text.strip("%"))/100
else:
continue
estimated=r.find("span", class_="date").get("title")
esttime=time.strptime(estimated,"%Y-%m-%d %H:%M:%S UTC")
print("{},{},{},{},{}".format(linkp.replace("/predictions/", ""), mktime(restime)-mktime(createtime), res, est, mktime(restime)-mktime(esttime)))
for page in range(1,400):
url="https://predictionbook.com/predictions/page/{}".format(page)
req=urllib2.Request(url, headers={"User-Agent" : "Firefox"})
try:
con=urllib2.urlopen(req)
except (urllib2.HTTPError, urllib2.URLError) as e:
continue
data=con.read()
soup=BeautifulSoup(data, "html.parser")
predright=soup.find_all("li", {"class": "prediction right"})
predwrong=soup.find_all("li", {"class": "prediction wrong"})
for pred in predright:
linkp=pred.span.a.get("href")
showforecasts(linkp, "1.0")
for pred in predwrong:
linkp=pred.span.a.get("href")
showforecasts(linkp, "0.0")
```

Surprisingly, both platforms had almost the same amount of individual predictions on binary resolved questions: ~48k for Metaculus, and ~44k for PredictionBook.

## Three Different Analyses: An Illustrative Example

In this text, I analyze the relation between accuracy and range in forecasting, considering three different aspects:

- Between forecasts
- Between questions
- Within questions

What exactly does this mean?

Let's say there are two people: Bessie and Heloïse. They are trying to make predictions about the weather about different time horizons (it is currently end of August):

- Will it rain tomorrow? (resolution: no/0)
- Will the average temperature in August in 1 year be higher than 20°C? (resolution: no/0)

Let's say that they make the following predictions:

- Bessie: 0.3 for 1, 0.85 for 2
- Heloïse: 0.1 for 1, 0.6 for 2

Let's also say that they make their predictions in alphabetical order of their names, one hour after another (Bessie at 00:00 and Heloïse at 01:00).

### Judging Between Forecasts

Evaluating the relation between forecasts would be as following: Each forecast, its resolution and its timespan are independently analyzed.

We have four predictions:

- One with a range of 23 hours, a probability of 0.1 (Heloïse's prediction on 1), and a resolution of 0
- One with a range of 24 hours, a probability of 0.3, (Bessie's prediction on 1) and a resolution of 0
- One with a range of (it's not a leap year), a probability of 0.6 (Heloïse's prediction on 2), and a resolution 0
- One with a range of , a probability of 0.85 (Bessie's prediction on 2), and a resolution 0

The Brier scores for ranges are then 0.01 for 23h, 0.09 for 24h, 0.36 for 8759h, and 0.7225 for 8760h. Here, higher range between forecasts is correlated with worse performance.

### Judging Between Questions

Judging the performance between questions now means looking at the forecasts made on each question and evaluating the performance of forecasts on that question.

Question 1 has a range of 24h, and question 2 has a range of 8760h. The Brier score for predictions on question 1 is 0.05, and the Brier score for predictions on question 2 is 0.54125. In this case, a higher range seems to be worse for performance on questions (Brier scores are lower/better for question 1).

### Judging Within Questions

Within questions one examines each question separately.

On question 1, the forecast with the higher range has a Brier score of 0.09, and the forecast with the lower range has a brier score of 0.01. So for question 1, higher range is correlated with worse performance.

For question 2, it is similar, the forecast with the higher range (8760h) has a score of 0.7225, while the forecast with the lower range (8759h) has a score of 0.36. Here also higher range is correlated with worse performance.

One can now try to aggregate the findings from the two questions and could tentatively conclude that generally range within questions is correlated negatively with accuracy of forecasts.

These were of course only illustrative examples, but I hope that now the different approaches in this text are clearer than before.

## Accuracy Between Forecasts

The first approach I took was to simply take the probability and result for each forecast, and calculate the Brier score for that one probability. I then calculated the correlation and the linear regression between that Brier score and the range of the forecast.

### Analysis

Now that the two datasets are available, they can be properly analyzed.

First, the raw data is loaded from the two CSV files, removing the
first line (the names of the variables, for other languages such as
R). Then the ID is converted to integer, and the rest of the fields are
converted to floats (the range is a float for some Metaculus questions,
and while the result can only take on 0 or 1, using float there makes
it easier to calculate the brier score using `mse.set`

). After that,
negative ranges are removed from the dataset.

```
.fc(.ic("../../data/pb.csv"));pbraw::csv.load()
.fc(.ic("../../data/met.csv"));metraw::csv.load()
pbdata::+flr({0<*|x};{(1:$*x),1.0:$'1_x}'1_pbraw)
metdata::+flr({0<*|x};{(1:$*x),1.0:$'1_x}'1_metraw)
```

#### Why Some Negative Ranges?

This code filters out forecast ranges smaller than 0, which is necessary because the data contains some forecasts with negative ranges. These stem from two different sources:

In the Metaculus data, these are forecasts on questions that have resolved retroactively. These occur in the scenario where forecasters predict on a question where the resolution time is not clear, and the resolution occurs before the question closes. To prevent an unfair advantage of people who predicted while the resolution was unfolding (and therefore predicting on an event that had happened in the past), the resolution date is set some timespan before the resolving event (e.g. a day). However, the predictions after the retroactive resolution are still included in the data.

Examples:

- Will Iran execute or be targeted in a national military attack between 6 June 2019 and 5 October 2019?
- Will EA Global San Francisco be cancelled or rescheduled due to COVID-19?

For PredictionBook, users can still predict after any resolution. The script fetches the first resolution, making some predictions retroactive. I could instead retrieve the result of the last resolution, but I'm not sure it would be worth the effort, or improve the quality of the data very much.

Examples:

- Total deaths due to coronavirus in the Netherlands will go over >5000 by the end of April.
- Matt will be happy he will no longer be able to be instantly Rick rolled

In the next step, I extracted the individual variables from the data
and gave them names (handling the various indices was tiresome after
a while). `ress`

stands for results, `fcs`

for forecasts, and `rngs`

for ranges:

```
metress::metdata@2
metfcs::metdata@3
metrngs::metdata@4
pbress::pbdata@2
pbfcs::pbdata@3
pbrngs::pbdata@4
```

The Brier Score is a scoring rule for binary forecasts. It takes into account both calibration and resolution by basically being the mean squared error of forecast () and outcome ():

```
$$BS=\frac{1}{N}\sum_{t=1}^{N}(f_{t}-o_{t})^{2}$$
```

In Klong, it's easy to implement (and also available through
the function `mse.set`

):

```
brier::{mu((x-y)^2)}
```

Now, one can calculate the brier score for each of the forecasts and outcomes, with the mean being unnecessary, because there is only one datapoint for each application:

```
metbriers::(metress-metfcs)^2
pbbriers::(pbress-pbfcs)^2
```

### Results

First, one can check how high the range of these two datasets really is. The PredictionBook forecasts with the highest range span 3730 days (more than 10 years), for Metaculus it's 1387 days (nearly 4 years):

```
(|/metrngs)%(24*60*60)
1387.01877932435104
(|/pbrngs)%(24*60*60)
3730.00945601851852
```

One can now look at the correlation between range and Brier score first for Metaculus, and then for PredictionBook:

```
cor(metbriers;metrngs)
0.0216592389375953837
cor(pbbriers;pbrngs)
-0.0202455558749736788
```

For Metaculus, the results are not very surprising: The positive correlation tells us that the higher the range of a forecast, the lower the accuracy (or, poetically, at Metaculus the fogs of time grow thicker the farther you want to look into the future).

However, for PredictionBook, the opposite is true (on this dataset): Forecasts with higher ranges give more accurate predictions, at least on average.

However, these correlations are quite weak, 0.02 could just be random noise. I would have to use a significance test to discern whether they are statistically significant.

Now, one can also perform a linear regression to gauge what the relation of range and accuracy of a forecast is. For this, I first create an x/y set with the brier score of a forecast being in the y axis and the range in the x axis:

```
mettab::+(,metrngs),,metbriers
pbtab::+(,pbrngs),,pbbriers
```

Now, a linear regression is easy:

```
lreg(mettab)
[0.000000000172708060226394773 0.167538673280194372]
lreg(pbtab)
[-0.000000000102929939681891687 0.16348453886964792]
```

These are not particularly surprising. The inferred brier score at range 0 (the forecast directly before resolution) is ~0.16, which seems a bit pessimistic, but other than that, growth with higher ranges for metaculus data and lower accuracy for higher ranges for predictionbook data match the correlation. The steepness of the regression is quite low because the ranges are in seconds.

Visualizing the forecasts with scatterplots and linear regressions shows a very similar picture (red dots are for Metaculus forecasts, blue dots are for PredictionBook forecasts):

*Scatterplot with linear regression for Metaculus & PredictionBook forecasts by range (in seconds)*

The high amounts of noise are probably due to the low number of predictions for single days (or, in the case of weeks and months, for years/months with a high range, as not enough questions with this range have resolved yet).

### Why Assume Accuracy will Increase?

I believe that the finding for the PredictionBook data is quite surprising.

A priori, one would believe that beliefs about the near future are generally more accurate than beliefs about the far future: We can predict the weather in 2 minutes far better than the weather in 6 months, we can say much more about the position of a rock in an hour than in 100 years, more about the popularity of a political party in 2 months as opposed to 10 years. Even in reasonably chaotic systems, one should expect to become more and more accurate the closer one comes to the expected time.

Take, for example, a double pendulum: I am totally able to predict its position & velocity 100ms before resolution time, but 1s before and it's already getting more difficult. Information, like nearly everything else, has diminishing value, posteriors converge continuously towards truth.

### Possible Explanations

So, what is the reason for this rather weird finding? Several possible reasons come to mind.

#### Range and Biased Questions

The most obvious solution is that the analysis above is absolute bogus and completely meaningless: It compares questions about global catastrophic risks to questions about the extinction of banana brands, very different kinds of questions with very different kinds of forecasts.

Here, one would assume that the longer-term questions asked are generally easier to predict, and that the effect goes away when one compares predictions among very similary questions (or, better, within questions).

Generally, the long-term questions we prefer asking seem to be more menable to forecasting than short-term questions: development of population sizes, the climate, especially the movement of interstellar bodies is much more thoroughly modelled than the development of markets, elections and the weather. This is of course only a weak trend, but one that could influence the questions (as will be investigated in this section).

##### Simpson's Paradox

Simpson's Paradox is the phenomenon that while two features are correlated in a dataset, it can be true that the features are negatively correlated for some distinct subsets of the dataset.

It is best explained visually:

It might be the case that this analysis has come up against an instance of Simpson's paradox: The accuracy of forecasts is negatively correlated with range within the same question, but the accuracy of forecasts is positively correlated with range across questions (because the kinds of questions with longer time horizons generally allow more accurate forecasts). Unfortunately, whether Simpson's paradox applies or not can not always be easily judged from the scatterplot of datapoints.

#### Low Sample Sizes With High Ranges

Another question one might ask is: How big are the sample sizes at the tails when the range is high?

This is important: low sample sizes increase noise dramatically, and make findings much less reliable.

To get a rough overview over the sample sizes, on can look at the number
of samples for each bucket. The sample sizes were calculated such that
at position i in the array `{pb,met}ss`

was the sample size for week i:

```
metss::_metrngs%60*60*24*7
metss::metss@<metss
pbss::_pbrngs%60*60*24*7
pbss::pbss@<pbss
maxval::|/pbss,metss
maxlen::|/(#'=metss),#'=pbss
pbss::{#pbss?x}'1+!maxval
metss::{#metss?x}'1+!maxval
```

I generated charts for the sample sizes in weeks:

```
ssplot::.oc("ss_plot.eps")
.tc(ssplot)
setrgb(0;0;0)
grid([0],maxlen,(maxlen:%20);[0],maxval,(maxval:%20))
xtitle("Range (in weeks)")
ytitle("Number of predictions")
setrgb(0;0;1)
segplot(metss)
setrgb(1;0;0)
segplot(pbss)
draw()
.fl()
.cc(ssplot)
```

*Sample sizes for predictions with a range (in weeks), sorted and graphed.*

The red graphs stands for Metaculus sample sizes, the blue graph stands for PredictionBook sample sizes.

As one can see, the sample sizes have a drastical skew towards recent predictions, not surprising for relatively young platforms (although 10 years for PredictionBook is sizable by internet standards, it's not that much compared to the expected range of some predictions on the platform, which might go into the thousands of years).

This can be seen in the data as well: The median range of metaculus and predictionbook predictions is only a couple of months, and less than 25% of questions have a range of more than one year:

```
Q(0.25;metrngs%60*60*24*365)
0.0937919114852302448
Q(0.5;metrngs%60*60*24*365)
0.34114799258678412
Q(0.75;metrngs%60*60*24*365)
0.917333030834854515
Q(0.25;pbrngs%60*60*24*365)
0.0435826674277016743
Q(0.5;pbrngs%60*60*24*365)
0.308100377346524606
Q(0.75;pbrngs%60*60*24*365)
0.977603754439370878
```

I hope that the dataset becomes richer the older these platforms become.

Because in the linear regression all datapoints are weighted equally, it could very well be that a tiny bit of noise at the tails dominates the entire regression.

## Accuracy Between Questions

Another way to determine at the relation between forecasting accuracy and range is to look at the range of questions and not of individual forecasts.

In this case, this means taking the forecasts on all questions with a given range and calculating the brier score on these forecasts.

### Determining the Range of a Question

The range of a question is determined by taking the time difference between the opening time (the time when the first prediction on the question could have been made) and the resolution time. One could imagine other metrics to determine the range of a question: the mean range for forecasts of that question, the median range for forecasts on that question, time differences between writing/opening and closing/resolution times of the question, and probably many more.

Here, the range of a question was set to the time difference between opening time and resolution time. The reasons for this were threefold:

First, I had no clear idea about the time when people were making forecasts on questions. Are most of the forecasts made just after opening, or just before closing? Or is the distribution uniform on the time between opening and closing? And are these distributions different on long-range as opposed to short-range questions? Also, I was unsure whether taking the mean time for forecasts would just be the same as comparing forecasts directly. So taking the median or the mean of the forecasts made was less preferable.

Second, what I cared about here was the uncertainty of questions at time of writing, not at time of prediction. This is much better tracked by opening time than by proxy on the forecasts.

Third, there was the question of data availability. Both Metaculus and PredictionBook publish opening/resolution times, but PredictionBook has no clear distinction between closing and resolution time (there is, however, a distinction between effective resolution time and planned resolution time ("When was the question resolved?" vs. "When should the question have been resolved?")).

### Analysis

First, the dataset grouped by forecasts had to be grouped by the question ID, in both cases a positive integer. The resulting datastructure should have the structure

```
[[id open-resolve-timediff [outcomes] [forecasts] [forecast-resolve-timediffs]]*]`
```

where the splat just indicates the inner list can be repeated. This is achieved by first finding the grouping of forecasts by question ID, then concatenating the ID, the question range, the list of outcomes, the list of forecasts and the list of forecast ranges:

```
metquestions::{(*x@0),(*x@1),2_x}'+'(+metdata)@=*metdata
pbquestions::{(*x@0),(*x@1),2_x}'+'(+pbdata)@=*pbdata
```

Strictly speaking, the outcomes could be a single element, since for every question there is only one well-defined outcome, but this makes it easier to later compute the brier score. Showcase:

```
metquestions@10
[474 497590.0 [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0] [0.79 0.8 0.99 0.8 0.8 0.65 0.65 0.8 0.8 0.81 0.81 0.7] [249575.65223908424 249548.86438822746 245775.7940876484 242420.23024630547 230434.71577501297 230276.97260832787 230111.41609930992 229967.06126213074 216594.73318576813 207687.5192539692 177898.677213192 151590.6441845894]]
brier@(metquestions@10)@[2 3]
0.62095
```

One can now also see how many questions there are in the two datasets (with the relatively unsurprising result that PredictionBook has much more resolved questions):

```
#metquestions
557
#pbquestions
13356
```

The next step involves computing the Brier score for the forecasts on each question:

```
metqbrier::{(x@1),brier(x@2;x@3)}'metquestions
pbqbrier::{(x@1),brier(x@2;x@3)}'pbquestions
```

`metqbrier`

is a list that contains sublists, one for each question,
the sublist containing the range for the question and the brier score
for all predictions on the question (`pbqbrier`

has the same structure).

### Results

Again I use linear regressions, correlation coefficients and scatter plots to inadequately analyze the data.

For accuracy between questions, the results were pretty surprising:

```
cor@+metqbrier
-0.00994020489696282446
cor@+pbqbrier
-0.051808239905807497
lreg(metqbrier)
[-0.0000000000601753889846147087 0.175130112661923862]
lreg(pbqbrier)
[-0.000000000249291592263056412 0.195254764708843302]
```

For Metaculus, the slope off the linear regression is approximately , compared that with for the slope for the linear regression between forecasts – the slope is less steep, but also negative. For PredictionBook, the slope of the linear regression is , compared with for the data between forecasts, which is slightly steeper.

In both cases, there was a negative correlation between the brier score and the range (to be precise, the higher the range, the lower the brier score/the higher the accuracy). For the Metaculus data, this effect was not as pronounced as for the PredictionBook data, though both correlations were quite weak. The two linear regressions also showed the same effect (lower accuracy at shorter ranges/higher accuracy at higher ranges), but again the slope of the linear regression was not very steep.

And now: linear regressions and scatterplots!

The following are scatterplots with range on the X-axis and accuracy (calculated using the Brier score) on the Y-axis. Again, red dots/lines are for Metaculus data, and blue dots/lines are for PredictionBook data.

*Scatterplot with linear regression for Metaculus & PredictionBook question accuracy by range*

The general trend seems to be: questions with a higher range tend to receive forecasts that have a higher accuracy than questions with a lower range. In itself, this is already a fascinating finding, and might explain some of the effect seen with accuracy between forecasts in the previous section). On the other hand, the data is still very noisy, the correlations found are quite weak, and the slopes of the linear regressions are are very near 0.

All in all, it's plausible that the relation of range and accuracy between questions explains a large part of the the weird relation for accuracy and range between forecasts, but I don't know enough statistics to tease these out exactly. My intuition tells me that the effect on accuracy between questions is too small to explain the whole anomaly between forecasts.

## Accuracy Within Questions

If there exists any bias in regard to what kinds of questions get asked in relation to their range, how can we correct for this bias?

One approach could be to compare very similar questions, such as only questions about artificial intelligence, the cost & speed of gene sequencing or autonomous cars, and examine the relation of range and accuracy within these categories. This might eliminate bias resulting from questions in different kinds of domains being easier or harder to forecast.

Here, I take a simpler approach. I examine the relation of range and accuracy within questions; are forecasts made on the same question later generally more accurate than forecasts made on a question earlier?

### Analysis

In order to do this, one can compute the Brier score for each prediction, and then perform one linear regression/compute the correlation per question to discern whether the relation is positive or not.

With `metquestions`

and `pbquestions`

, we already have the necessary
data available to perform the analysis.

We can create a list of the form `[[[brier_scores][ranges]]*]`

:

```
wmetqbrier::{(,x@4),,((x@2)-x@3)^2}'metquestions
wpbqbrier::{(,x@4),,((x@2)-x@3)^2}'pbquestions
```

Since `lreg`

can't deal with datasets of size 1, we have to filter those
out of `wpbqbrier`

(they don't make much sense in our analysis either):

```
wpbqbrier::flr({1<#*x};wpbqbrier)
```

One can play around and calculate the correlation between range and accuracy for every question:

```
4#{cor@x}'wmetqbrier
[0.763628932400678817 0.46136759691608953 -0.139435096904356686 -0.882370278576558711]
4#{cor@x}'wpbqbrier
[-1.0 -1.0 -1.0 -1.0]
```

The perfect negative correlation comes from the fact that the first questions in the dataset have only two predictions, which all by chance anti-correlate with the range. This is not the case for all questions:

```
#'*'4#wpbqbrier
[2 2 2 2]
4#|{cor@x}'wpbqbrier
[0.89254316971805467 1.0 0.950010315421882544 0.801892491489916431]
```

However, we won't be using that information here.

For the linear regression, one needs to convert the data for each question from a two lists of values into x/y pairs:

```
wmetqbrier::+'wmetqbrier
wpbqbrier::+'wpbqbrier
```

One can then compute the linear regression on the datasets for each question:

```
4#lreg'wmetqbrier
[[0.0000000011767800408891019 0.00710923979466614] [0.000000000617896058360111251 -0.00048849375750407371] [-0.00000000341699728537160739 0.174909387397966508] [-0.0000000862529408862358451 2.97369568231620124]]
4#lreg'wpbqbrier
kg: error: plus: type error: [:undefined :undefined]
```

The result for `wpbqbrier`

is unexpected. The culprit turns out to be a
question on which the same prediction has been made, twice, at the
exact same second, which confuses the linear regression algorithm:

```
wpbqbrier@[1381]
[[[35029.0 0.09] [35029.0 0.09]]]
```

One can find that there are several such datapoints:

```
flr({((*x)~x@1)&2=#x};wpbqbrier)
[[[35029.0 0.09] [35029.0 0.09]] [[21824.0 0.0625] [21824.0 0.0625]] [[21804.0 0.0025] [21804.0 0.0025]] [[31684.0 0.04] [31684.0 0.04]] [[31793.0 0.25] [31793.0 0.25]] [[31717109.0 0.2025] [31717109.0 0.2025]] [[31717385.0 0.16] [31717385.0 0.16]] [[31717456.0 0.0225] [31717456.0 0.0225]] [[31717127.0 0.2025] [31717127.0 0.2025]] [[31717353.0 0.0225] [31717353.0 0.0225]] [[31717361.0 0.25] [31717361.0 0.25]]]
```

However, they can be filtered out pretty easily:

```
wpbqbrier::flr({(~(*x)~x@1)|2<#x};wpbqbrier)
#wpbqbrier
7596
```

### Result

We can now visualise the linear regression for each question by setting it to zero outside the range of the oldest and newest chunks:

```
sketch::{q::x;
setrgb(.rn();.rn();.rn());
pltr::{:[(x>**q)|x<**|q;0;lr(x;lreg(q))]};
plot(pltr)}
sketch'pchsmetq
```

*Linear regressions for the accuracy of questions by range (only Metaculus data).*

The vertical bars are artifacts stemming from the fact that Klong attempts to makes the discontinuous function continuous, connecting 0 and the linear regression.

Although the plot is kind of cool to look at, I'm not really sure what
it can tell us. My *guess* would be that it somewhat shows a trend
with higher ranges responding to higher Brier scores (and therefore
lower accuracy).

We can test whether this suspicion is acually correct by calculating the average offset and the average ascension – if the ascension is positive, our suspicion is confirmed.

```
mu'+lreg'wmetqbrier
[0.0000000343952235958370255 0.0375813222857496718]
mu'+lreg'wpbqbrier
[0.0000359749784304593045 -238.963047593209695]
```

So it is true that accuracy within question *generally* is higher with
lower range for Metaculus data. Everything else would have been surprising.

*Mean of linear regressions on accuracy within questions (red is Metaculus data, blue is PredictionBook data).*

### Sample Sizes

One might, again, be interested in the sample sizes. How many predictions to questions receive?

As we know, the Metaculus dataset contains predictions on 557 questions, the PredictionBook dataset 13356, but there are way fewer questions with more than 1 unique prediction in the PredictionBook dataset:

```
#metquestions
557
#pbquestions
13356
#wmetqbrier
557
#wpbqbrier
7596
```

Let's first create sorted lists containing the numbers of forecasts on each question:

```
metlens::#'wmetqbrier
metlens::metlens@<metlens
pblens::#'wpbqbrier
pblens::pblens@<pblens
```

One can now look at some central values for those datasets: the maximum, mimimum, mean, median, and mode:

```
&/metlens
2
|/metlens
101
mu(metlens)
86.8348294434470377
M(metlens)
101
mo(metlens)
[101]
&/pblens
2
|/pblens
99
mu(pblens)
5.07253817798841496
M(pblens)
3.0
mo(pblens)
[2]
```

This is – surprising, to say the least. Metaculus makes creating
new questions much harder, and more strongly encourages users to
predict on existing questions, with an elaborate tagging system for
questions. PredictionBook on the other hand simplifies the questions
creation process, leaving out moderation, complex resolution criteria
etc. Still, I'm surprised – there must be at least *one* PredictionBook
question popular enough for 100 forecasts! But apparently not.

#### Interlude: It's Under 102

One result here is, to say the least, confusing – where did
all those 101s come from in the Metaculus data‽ Surely,
there are questions with more than 101 forecasts (which
I **know**, this question about 2016 being the warmest year on
record
has 765 forecasts)!

I initially suspected a bug in my code, but to my surprise, after further investigation, it turns out that the Metaculus API returns timeseries with elements removed so that the length was always 101.

I can think of two reasons to do this:

- Metaculus wants to prevent other entities from using the predictions to create stronger forecasting algorithms that could rival the Metaculus algorithm
- It was programmed in as a hard limit when Metaculus wasn't as big as it is now, and never changed

I mailed the support address on the site, asking for a full timeseries on resolved binary questions.

After the support address had not responded to my inquiry, I contacted one of the admins of the site on the Discord, but was informed that updating the API would be too difficult to do (which is understandable, the Metaculus developers do not exist to cater to my whims, and are doing a phenomenal job). So, unfortunately I'll have postpone a more complete analysis to later.

## Conclusion

Using two datasets with both ~45k predictions, having ranges between 1 day and 10 years (thereby containing forcasts with short and medium range) I have investigated the relation between the accuracy of predictions and their range (that is, the time between the prediction being made and the result of the prediction being known).

I have found that the data indicates three facts:

- For predictions made on any question, the predictions made a long time before their resolution are generally more accurate than predictions made a shorter time before their resolution. This can be partially, but not completely explained by fact 2.
- Questions with a longer range (that is, time between the question being written and the question being resolved) generally receive predictions with a higher accuracy than questions with a shorter range.
- Predictions made on the same question earlier are generally less accurate than predictions that are made later.

These results vary strongly between Metaculus and PredictionBook, with observations 1. and 2. much weaker or non-existent in PredictionBook data (observation 3. only holds for Metaculus, because there are no questions on PredictionBook with enough forecasts to run the analysis).

These results suggest what to expect with questions with even greater range: That later predictions on them will generally be more accurate, and that the kinds of questions asked with a very high range might have engender prediction with an even accuracy than questions with short and medium ranges.

However, there are plausible reasons to expect the trend from 1. and 2. to reverse: The questions asked with very high range are not very different from questions with medium range, and have a lot less information available to make useful predictions on them; butterfly effects start kicking in in systems that are relatively slow moving on human timescales (thus easier to predict on medium timescales), but nearly completely random at the scale of decades and/or centuries; the questions asked about longer timescales are of a different kind and much less predictable.

I hope to update this analysis in the future, when data from predictions with higher ranges has become available, and to check whether the findings in this analysis continue to be correct.

## Acknowledgements

I am grateful to Nuño Sempere for pointing out a fatal flaw in my previous version of this analysis, which caused me to rewrite it nearly completely.

## Miscellaneous

The code for image generation can be found here, the complete code for analyzing the data can be found here.

## 17 comments

Comments sorted by top scores.

## comment by CharlesD · 2021-05-04T14:46:35.127Z · LW(p) · GW(p)

Thanks for a great post! I have a concern about your sample sizes however.

I am looking into similar questions myself, and while reading your post I was surprised to see your Metaculus sample claimed as 45k predictions. These are not actually individual predictions, but rather the time series of community predictions, which are much less information dense, as this is just the median of the recent community predictions at that time and typically a new prediction will have a small effect on this value. I think claiming the sample size is 45k is therefore a bit misleading.

It also has the effect of linearly weighting Metaculus questions by community interest, which is not obviously a desirable method (this is mitigated by the cap on the time series length to 101, which means the effect will ultimately be small as you have 557 Metaculus questions, implying your average question must have more than 80 predictions)

Replies from: niplav## ↑ comment by niplav · 2021-05-04T22:53:35.764Z · LW(p) · GW(p)

Yep, I share your concerns! I wanted to include them in the post, but then I got busy. Perhaps I'll update it in the forseeable future (no promises however, I'm pretty busy with other things). Maybe I'll just put a warning at the top of the article.

And, in case you publish your stuff, I'd love to read it.

## comment by NunoSempere (Radamantis) · 2020-11-18T17:38:19.521Z · LW(p) · GW(p)

Why do some forecast have negative ranges?

Replies from: niplav, Radamantis## ↑ comment by niplav · 2020-11-22T10:19:23.108Z · LW(p) · GW(p)

On Metaculus: I assume that these are forecasts on questions that resolved retroactively. Examples:

- Will Iran execute or be targeted in a national military attack between 6 June 2019 and 5 October 2019?
- https://www.metaculus.com/questions/3756/will-ea-global-san-francisco-be-cancelled-or-rescheduled-due-to-covid-19/

For PredictionBook: The datetime of resolution seems to be the datetime of the first attempted resolution, not the last. Example: Total deaths due to coronavirus in the Netherlands will go over >5000 by the end of April. .

I think I might change the PredictionBook data fetching script to output the datetime of the last resolution.

## ↑ comment by NunoSempere (Radamantis) · 2020-11-18T17:52:25.592Z · LW(p) · GW(p)

So I was trying to adjust for longer term questions being easier by doing the follow:

- For each question, calculate the average Brier score for available predictions
- For each prediction, calculate the accuracy score as Brier score - average Brier scores of the question.

Correlate accuracy score with range. So I was trying to do that, and I thought, well, I might as well run the correlation between accuracy score and log range. But then some of the ranges are negative, which shouldn't be the case.

Replies from: Radamantis## ↑ comment by NunoSempere (Radamantis) · 2020-11-18T17:55:11.898Z · LW(p) · GW(p)

Anyways, if I adjust for question difficulty, results are as you would expect; accuracy is worse the further removed the forecast is from the resolution.

## comment by NunoSempere (Radamantis) · 2020-11-18T17:17:35.163Z · LW(p) · GW(p)

Nice post! I agree that the conclusion is counterintuitive.

For Metaculus, the results are pretty astonishing: the correlation is negative for all four options, meaning that the higher the range of the question, the lower the Brier score (and therefore, the higher the accuracy)! And the correlation is extremly low either: -0.2 is quite formidable.

I tried to replicate some of your analysis, but I got different results for Metaculus (I still got the negative correlation for PredictionBook, though). I think this might be to an extent an artifact of the way you group your forecasts:

In bash, add headers, so that I can open the files and see how they look

```
$ echo "id,questionrange,result,probability,range" > met2.csv
$ cat met.csv >> met2.csv
$ echo "id,questionrange,result,probability,range" > pb.csv
$ cat pb.csv >> pb2.csv
```

In R:

```
library(ggplot2)
## Metaculus
data <- read.csv("met2.csv")
data$brier = (data$result-data$probability)^2
summary(lm(data$brier ~ data$range)) ## Positive correlation.
ggplot(data=data, aes(x=range, y=brier))+
geom_point(size=0.1)
### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)
summary(lm(data$briernorm ~ data$rangenorm))
### I get a correlation of ~0.02, on a standard deviation of 1, i.e., a correlation of 2%.
## Same thing for PredictionBook
data <- read.csv("pb2.csv")
data$brier = (data$result-data$probability)^2
summary(lm(data$brier ~ data$range)) ## Negative correlation.
ggplot(data=data, aes(x=range, y=brier))+
geom_point(size=0.2)
### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)
summary(lm(data$briernorm ~ data$rangenorm))
### I get a correlation of ~-0.02, on a standard deviation of 1, i.e., a correlation of -2%.
```

Essentially, when you say

To compare the accuracy between forecasts, one can't deal with individual forecasts, only with sets of forecasts and outcomes. Here, I organise the predictions into buckets according to range.

This doesn't necessarily follow, i.e., you can still calculate a regression between Brier score and range (time until resolution).

Replies from: niplav, Radamantis, Radamantis## ↑ comment by niplav · 2020-12-19T13:17:13.558Z · LW(p) · GW(p)

Okay, I finally had some time to look at your feedback.

The problem is, as you said, my attempt to bucket predictions together after range. This removes data, and makes my analysis much more complicated than it needs to be.

I thought that bucketing was a good idea because I was not sure how meaningful a brier score on only one forecast & outcome variable is (I didn't have a very clear idea of why that should be the case, and didn't question that intuition).

Let's say I have my datasets (predictions), (outcomes) and (ranges), .

Then your analysis is calculating . I introduced a partition variable () and calculated .

This throws away information: if one makes and , then one gets one brier score (of all forecasts & outcomes), and the average of all ranges, which results in a correlation of 1 (I haven't proven that partitioning more roughly loses data monotonically, but it seems intuitively true to me).

If I repeat your analysis, I get the results you got.

Basically, I believe my text lacks internal validity, but still has construct validity.

Starting from here, I will probably rewrite large parts of the text (and the code, maybe even in a more understandable language) and apply your analysis by removing the bucketing of data.

Replies from: Radamantis## ↑ comment by NunoSempere (Radamantis) · 2020-12-26T18:06:43.522Z · LW(p) · GW(p)

Cool. Once you rewrite that, and if you do so before the end of the year, I'd encourage you to resubmit it to this contest [LW · GW].

In particular, the reason I'm excited about this kind of work is because it allows us to have at least some information about how accurate long-term predictions can be. Some previous work on this has been done, e.g., rating Kurzweil's predictions from the 90s [LW · GW] but overall we have very little information about this kind of thing. And yet we are interested in seeing how good we can be at making predictions n years out, and potentially making decisions based on that.

## ↑ comment by NunoSempere (Radamantis) · 2020-11-18T17:58:25.057Z · LW(p) · GW(p)

Another interesting this you can do is to calculate the accuracy score (Brier score - average of the Brier scores for the question), which adjusts for question difficulty. You gesture at this in your "Accuracy between questions" section.

If you do this, forecasts made further from the resolution time do worse, both in PredictionBook and in Metaculus (correlation is p<0.001, but very small). Code in R:

```
datapre <- read.csv("pb2.csv") ## or met2.csv
data <- datapre[datapre$range>0,]
data$brier = (data$result-data$probability)^2
accuracyscores = c() ## Lower is better, much like the Brier score.
ranges = c()
for(id in unique(data$id)){
predictions4question = (data$id == id)
briers4question = data$brier[predictions4question]
accuracyscores4question = briers4question - mean(briers4question)
ranges4question = data$range[predictions4question]
accuracyscores=c(accuracyscores,accuracyscores4question)
ranges=c(ranges, ranges4question)
}
summary(lm(accuracyscores ~ ranges))
```

## ↑ comment by NunoSempere (Radamantis) · 2020-11-18T17:21:09.951Z · LW(p) · GW(p)

Another interesting thing you can do with the data is to calculate the prior probability that a Metaculus or PB question will resolve positively:

```
data <- read.csv("met2.csv") ## or pb2.csv
data$brier = (data$result-data$probability)^2
results = c()
for(id in unique(data$id)){
predictions = ( data$id == id )
result = data$result[predictions[1]]
results = c(results, result)
}
mean(results)
```

For Metaculus, this is 0.3160874, for PB this is 0.3770311

## comment by NunoSempere (Radamantis) · 2020-11-18T17:13:30.343Z · LW(p) · GW(p)

Nitpicks:

- Some typos: ones => one's; closed questions (questions that haven't yet been resolved, but that can still be predicted on) => closed questions (questions that haven't yet been resolved, but that can't be predicted on); PredictionPook => PredictionBook
- You don't clearly say when you start using Klong. Klong also sounds like it might be really fun to learn, but it's maybe a little suboptimal for replication purposes, because it isn't as well-known.

## ↑ comment by niplav · 2020-11-22T10:22:13.757Z · LW(p) · GW(p)

I'll correct the typos.

As for Klong, I agree that it's not an optimal choice. I started this as a hobby project, and for odd reasons, it seemed to be the best tool at my disposal. I'll add a sentence that explains when I start using it, and will maybe try to replicate the analysis in a language that can be understood by more than a couple hundred people on the planet.

## comment by adamShimi · 2020-11-17T12:27:27.762Z · LW(p) · GW(p)

This is an amazing post. I know next to nothing about forecasting, but you recapped the big ideas and concepts, explained everything in a reproducible way, and even made interesting observations about the data. Thanks a lot!

Replies from: niplav