Throw a prediction party with your EA/rationality group

eukaryote

Throw a prediction party with your EA/rationality group

post by eukaryote · 2016-12-31T23:02:11.284Z · LW · GW · Legacy · 14 comments

14 comments

TL;DR: Prediction & calibration parties are an exciting way for your EA/rationality/LessWrong group to practice rationality skills and celebrate the new year.

On December 30th, Seattle Rationality had a prediction party. Around 15 people showed up, brought snacks, brewed coffee, and spent several hours making predictions for 2017, and generating confidence levels for those predictions.

This was heavily inspired by Scott Alexander’s yearly predictions. (2014 results, 2015 results, 2016 predictions.) Our move was to turn this into a communal activity, with a few alterations to meet our needs and make it work better in a group.

Procedure:

Each person individually writes a bunch of predictions for the upcoming year. They can be about global events, people’s personal lives, etc.
- If you use Scott Alexander’s system, create 5+ predictions each for fixed confidence levels (50%, 60%, 70%, 80%, 90%, 95%, etc.)
- If you want to generate Brier scores or logarithmic scores, just do 30+ predictions at whatever confidence levels you believe.
Write down confidence levels for each prediction.
Save your predictions and put it aside for 12 months.
Open up your predictions and see how everyone did.

To make this work in a group, we recommend the following:

Don’t share your confidence intervals. Avoid anchoring by just not naming how likely or unlikely you think any prediction is.
Do share predictions. Generating 30+ predictions is difficult, and sharing ideas (without confidence levels) makes it way easier to come up with a bunch. We made a shared google doc, and everyone pasted some of their predictions into it.
Make predictions that, in a year, will verifiably have happened or not. (IE, not “the academic year will go well”, which is debatable, but “I will finish the year with a 3.5 GPA or above”.)
It’s convenient to assume that unless stated otherwise predictions that end by the next year (IE, "I will go to the Bay Area" means "I will go to the Bay Area at least once in 2017.") It’s also fine to make predictions that have other end dates (“I will go to EA Global this summer.”)
Make a bunch of predictions first without thinking too hard about how likely they are, then assign confidence levels. This post details why. You could also generate a group list of predictions, and everyone individually lists their own confidence levels.

This makes a good activity for rationality/EA groups for the following reasons:

Practicing rationality skills:
- Making accurate predictions
- Using confidence intervals
Accessibility
- It’s open to many different knowledge levels. Even if you don’t know a thing about geopolitics, you can still give predictions and confidence intervals about media, sports, or your own life.
- More free-form and less intimidating than using a prediction market. You do not have to know about the details of forecasting to try this.
Natural time and recurring activity
- You could do this at any point during the year, but doing it at the start of the year seems appropriate for ringing in the new year.
- In twelve months, you have an automatic new activity, which is coming back together and checking everybody’s predictions from last year. Then you make a new set of predictions for next year. (If this falls through for some reason, everyone can, of course, still check their predictions on their own.)
Fostering a friendly sense of competitiveness
- Everyone wants to have the best calibration, or the lowest Brier score. Everyone wants to have the most accurate predictions!

Some examples of the predictions people used:

Any open challenges from the Good Judgment Project.
I will switch jobs.
I will make more than $1000 money in a way that is different from my primary job or stock.
I will exercise 3 or more times per week in October, November, December.
I’ll get another tattoo.
Gay marriage will continue to be legal in Washington state.
Gay marriage will continue to be legal in all 50 states.
I will try Focusing at least once.
I will go to another continent.
CRISPR clinical trials will happen on humans in the US.
A country that didn’t previously have nuclear weapons will acquire them.
I will read Thinking Fast and Slow.
I will go on at least 3 dates.

Also relevant:

16 types of useful predictions
Brier values and graphs of ‘perfect’ vs. actual scores will give you different information. Yvain writes about the differences between these. Several of us did predictions last year using the Scott Alexander method (bins at fixed probabilities), although this year, everybody seems to have used continuous probabilities. The exact method by which we’ll determine how well-calibrated we were will be left to Seattle Rationality of 2018, but will probably include Brier values AND something to determine calibration.

14 comments

Comments sorted by top scores.

comment by ChristianKl · 2017-01-02T14:36:26.669Z · LW(p) · GW(p)

If you use Scott Alexander’s system, create 5+ predictions each for fixed confidence levels (50%, 60%, 70%, 80%, 90%, 95%, etc.)

Why? It seems to me more productive to try to find confidence levels for predictions than to find predictions for confidence levels.

Additionally the book Superforecasting suggets that Superforcasters are people who make more granular predictions then 50%, 60%, 70%, 80%, 90%, 95%, etc.

comment by Douglas_Knight · 2017-01-01T18:27:04.149Z · LW(p) · GW(p)

You can do calibration and accuracy. You can start with predictions of arbitrary granularity and then force them into whatever boxes you want.

For calibration, it isn't very useful to score events at 50%. Instead of making boxes of 50, 60, 70, 80, 90, 95, 99%, you should instead do something like 55, 70, 80, 90, 95, 99%. Taking an event that you "really" think is 50/50 and forcing yourself to choose a side to make it 45/55 is no worse than taking an event that you think is 45/55 and forcing it to be either 50 or 60%.

Also, the jump from 95 to 99 is pretty big. Better to add an intermediate category of 97 or 98. Or just replace 99 with 98.

I think 60, 80, 90, 95, 98 would be a good set of bins for beginners.

Replies from: Unnamed, ChristianKl

↑ comment by Unnamed · 2017-01-01T19:45:24.965Z · LW(p) · GW(p)

50% predictions can be useful if you are systematic about which option you count as "yes". e.g., "I estimate a 50% chance that I will finish writing my book this year" is a meaningful prediction. If I am subject to standard biases, then we would expect this to have less than a 50% chance of happening, so the outcomes of predictions like this provide a meaningful test of my prediction ability.

2 conventions you could use for 50% predictions: 1) pose the question such that "yes" means an event happened and "no" is the default, or 2) pose the question such that "yes" is your preferred outcome and "no" is the less desirable outcome.

Actually, it is probably better to pick one of these conventions and use it for all predictions (so you'd use the whole range from 0-100, rather than just the top half of 50-100). "70% chance I will finish my book" is meaningfully different than "70% chance I will not finish my book"; we are throwing away information about possible miscalibrated by treating them both merely as 70% predictions.

Even better, you could pose the question however you like and also note when you make your prediction 1) which outcome (if either) is an event rather than the default and 2) which outcome (if either) you prefer. Then at the end of the year you could look at 3 graphs, one which looks at whether the outcome that you considered more likely occurred, one that looks at whether the (non-default) event occurred, and one which looks at whether your preferred outcome occurred.

Replies from: Douglas_Knight, Douglas_Knight, eukaryote

↑ comment by Douglas_Knight · 2017-01-02T18:40:45.319Z · LW(p) · GW(p)

Sorry, I misread your comment originally. You were careful to say that you were talking about 3 different biases, while most people say that there is a right way to orient each question.

But you weren't careful to say that calibration — the measure of over- and under-confidence — is different from bias. There are four questions here. Introducing new questions that make sense at 50% is irrelevant to the fact that calibration doesn't make sense at 50%. If we are just doing calibration, some of our tests are wasted. If we add a test of a bias, that part of the calibration test is still wasted. If we force the bin away from 50%, then that improves the calibration test. Moreover, I don't think that it harms the test of bias.

Ideally, we would look at everything, but is it worth the effort? If we start with one thing, what is most important? I think that overconfidence is the biggest problem and one should start there. In some sense the annotations you suggest are not much more work, but in making the difference between doing and not doing, I think small increments matter.

(While most people are overconfident and calibration exercises are mainly about reducing overconfidence, the problem of 50% is actually a problem of underconfidence.)

↑ comment by Douglas_Knight · 2017-01-01T23:42:19.342Z · LW(p) · GW(p)

Most questions don't have a preferred direction. Look at Scott's predictions. Which direction should you point each one?

Most people don't make enough predictions to get a statistically significant difference between the two sides of the scale. And even if they do, their bias to the extremes ("overconfidence") swamps the effect.

Replies from: Unnamed

↑ comment by Unnamed · 2017-01-02T00:57:47.357Z · LW(p) · GW(p)

Just looking at the 50% questions, here is how I would score 1) if either direction is an event rather than the default and 2) if either direction is probably preferred by Scott:

US unemployment to be lower at end of year than beginning: 50%

Neither direction is an event, Yes is preferred.

SpaceX successfully launches a reused rocket: 50%