Test Your Calibration!
post by alyssavance · 2009-11-11T22:03:38.439Z · LW · GW · Legacy · 33 commentsContents
33 comments
In my journeys across the land, I have, to date, encountered four sets of probability calibration tests. (If you just want to make bets on your predictions, you can use Intrade or another prediction market, but these generally don't record calibration data, only which of your bets paid out.) If anyone knows of other tests, please do mention them in the comments, and I'll add them to this post. To avoid spoilers, please do not post what you guessed for the calibration questions, or what the answers are.
The first, to boast shamelessly, is my own, at http://www.acceleratingfuture.com/tom/?p=129. My tests use fairly standard trivia questions (samples: "George Washington actually fathered how many children?", "Who was Woody Allen's first wife?", "What was Paul Revere's occupation?"), with an emphasis towards history and pop culture. The quizzes are scored automatically (by computer) and you choose whether to assign a probability of 96%, 90%, 75%, 50%, or 25% to your answer. There are five quizzes with fifty questions each: Quiz #1, Quiz #2, Quiz #3, Quiz #4 and Quiz #5.
The second is a project by John Salvatier (LW account) of the University of Washington, at http://calibratedprobabilityassessment.org/. There are three sets of questions with fifty questions each; two sets of general trivia, and one set of questions about relative distances between American cities (the fourth set, unfortunately, does not appear to be working at this time). The questions do not rotate, but are re-ordered upon refreshing. The probabilities are again multiple choice, with ranges of 51-60%, 61-70%, 71-80%, 81-90%, and 91-100%, for whichever answer you think is more probable. These quizzes are also scored by computer, but instead of spitting back numbers, the computer generates a graph, showing the discrepancy between your real accuracy rate and your claimed accuracy rate. Links: US cities, trivia #1, trivia #2.
The third is a quiz by Steven Smithee of Black Belt Bayesian (LW account here) at http://www.acceleratingfuture.com/steven/?p=96. There are three sets, of five questions each, about history, demographics, and Google rankings, and two sets of (non-testable) questions about the future and historical counterfactuals. (EDIT: Steven has built three more tests in addition to this one, at http://www.acceleratingfuture.com/steven/?p=102, http://www.acceleratingfuture.com/steven/?p=106, and http://www.acceleratingfuture.com/steven/?p=136). This test must be graded manually, and the answers are in one of the comments below the test (don't look at the comments if you don't want spoilers!).
The fourth is a website by Tricycle Developments, the web developers who built Less Wrong, at http://predictionbook.com/. You can make your own predictions about real-world events, or bet on other people's predictions, at whatever probability you want, and the website records how often you were right relative to the probabilities you assigned. However, since all predictions are made in advance of real-world events, it may take quite a while (on the order of months to years) before you can find out how accurate you were.
33 comments
Comments sorted by top scores.
comment by Rune · 2009-11-12T19:55:52.597Z · LW(p) · GW(p)
Advice for future creators of tests: There are people who live outside the US. No one outside the US cares about the 3rd person to be the second dead uncle of the fourth president of the US.
For instance, a majority of tommccabe's quiz questions are highly US-specific.
The point here is that non-Americans will end up guessing almost all questions, making the whole exercise painful and useless.
Replies from: SoerenMind, alyssavance↑ comment by SoerenMind · 2015-05-25T08:15:24.698Z · LW(p) · GW(p)
The best calibration IMO exercises I was able to find (which also work for non-Americans) can be downloaded from the website of How to Measure Anything.
↑ comment by alyssavance · 2009-11-12T20:36:38.755Z · LW(p) · GW(p)
Noted, but I didn't write those questions, they were taken from the open-source MisterHouse project. If you know of any sources of free trivia questions that aren't US-specific please do PM me.
Replies from: Cyan, Alicorn↑ comment by Alicorn · 2009-11-12T20:38:50.791Z · LW(p) · GW(p)
It seems like it'd be pretty easy to write your own trivia questions by permitting yourself to surf Wikipedia for a while and extract facts from the articles. What's the advantage to trivia questions you don't write yourself - just speed, or something else too?
Replies from: alyssavance↑ comment by alyssavance · 2009-11-12T22:38:59.705Z · LW(p) · GW(p)
Just speed. At two minutes per trivia question it would take a full day to make another set of 250.
comment by bentarm · 2009-11-12T00:27:35.541Z · LW(p) · GW(p)
Why isn't there a 33% option for your test? What if I'm pretty certain that 1 of the answers is wrong, but have no clue which of the others is most likely to be right? Then my confidence is exactly 33%, and I have to either overestimate or underestimate it. The 50% and 25% options seem to cover the other two versions of this scenario (I can eliminate either 2 or 0 of the options almost certainly) but this appears to be a gap.
(incidentally, this only occurred to me because it happened to be the case for the first question on the first of your quizzes...)
Replies from: alyssavance↑ comment by alyssavance · 2009-11-12T01:22:14.072Z · LW(p) · GW(p)
There probably should be, mea culpa.
comment by gerg · 2009-11-13T03:06:20.503Z · LW(p) · GW(p)
Part of the output of your quizzes is a line of the form "Your chance of being well calibrated, relative to the null hypothesis, is 50.445538580926 percent." How is this number computed?
I chose "25% confident" for 25 questions and got 6 of them (24%) right. That seems like a pretty good calibration ... but 50.44% chance of being well calibrated relative to null doesn't seem that good. Does that sentence mean that an observer, given my test results, would assign a 50.44% probability to my being well calibrated and a 49.56% probability to my not being well calibrated? (or to my randomly choosing answers?) Or something else?
Replies from: skepsci↑ comment by skepsci · 2012-02-28T05:48:53.831Z · LW(p) · GW(p)
It's also completely ridiculous, with a sample size of ~10 questions, to give the success rate and probability of being well calibrated as percentages with 12 decimals. Since the uncertainty in such a small sample is on the order of several percent, just round to the nearest percentage.
Replies from: MTGandP↑ comment by MTGandP · 2015-01-11T05:57:27.630Z · LW(p) · GW(p)
It probably just computes it as a float and then prints the whole float.
(I do recognize the silliness of replying to a three-year old comment that itself is replying to a six-year old comment.)
Replies from: Soothsilver↑ comment by Soothsilver · 2016-01-18T14:29:48.410Z · LW(p) · GW(p)
It's not silly. I still find these newer comments useful.
Replies from: MTGandP↑ comment by MTGandP · 2016-01-29T01:06:55.973Z · LW(p) · GW(p)
And here we are one year later!
Replies from: Evan Rysdam↑ comment by Sunny from QAD (Evan Rysdam) · 2019-07-28T13:36:24.489Z · LW(p) · GW(p)
Yes, do it for posterity!
Replies from: matteo-de-stefano↑ comment by Matteo De Stefano (matteo-de-stefano) · 2022-01-10T09:21:27.671Z · LW(p) · GW(p)
I would like to chime in and point out that as today the domain "acceleratingfuture (dot) com" is owned by a russian bookmaker.
comment by elazdins · 2021-11-03T06:35:55.844Z · LW(p) · GW(p)
Just launched my own version of a calibration test here - https://calibration.lazdini.lv/ it is pretty much identical to http://confidence.success-equation.com/ except the questions should be different each time you visit the site, allowing for regular calibration/recalibration. Questions are retrieved from the free API provided by https://opentdb.com/.
comment by jimrandomh · 2009-11-12T02:38:13.469Z · LW(p) · GW(p)
I would like to see a calibration test with open-ended questions rather than multiple choice. Multiple choice makes it easier to judge confidence, but I'm afraid the calibrations won't transfer well to other domains.
(The test-taker would have to grade their test, since open ended questions may have multiple answers, and typos and minor variations shouldn't count as errors. But other than that, the test would be pretty much the same.)
Replies from: KingSupernova↑ comment by Isaac King (KingSupernova) · 2021-10-05T05:00:53.320Z · LW(p) · GW(p)
An open-ended probability calibration test is something I've been planning to build. I'd be curious to hear your thoughts on how the specifics should be implemented. How should they grade their own test in a way that avoids bias and still gives useful results?
comment by SK2 (lunchbox) · 2010-01-09T10:00:23.770Z · LW(p) · GW(p)
I have seen a problem with selection bias in calibration tests, where trick questions are overrepresented. For example, in this PDF article, the authors ask subjects to provide a 90% confidence interval estimating the number of employees IBM has. They find that fewer than 90% of subjects select a suitable range, which they conclude results from overconfidence. However, IBM has almost 400,000 employees, which is atypically high (more than 4x Microsoft). The results of this study have just as much to do with the question asked as with the overconfidence of the subjects.
Similarly, trivia questions are frequently (though not always) designed to have interesting/unintuitive answers, making them problematic for a calibration quiz where people are expecting straightforward questions. I don't know that to be the case for the AcceleratingFuture quizzes, but it is an issue in general.
Replies from: Blueberry↑ comment by Blueberry · 2010-01-09T10:03:41.661Z · LW(p) · GW(p)
That really shouldn't matter. Your calibration should include the chances of the question being a "trick question". If fewer than 90% of subjects give confidence intervals containing the actual number of employees, they're being overconfident by underestimating the probability that the question has an unexpected answer.
Replies from: lunchbox↑ comment by SK2 (lunchbox) · 2010-01-09T19:11:33.409Z · LW(p) · GW(p)
Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.
This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being asked should suggest that the methodology is problematic.
What this comes down to is: how do you estimate the probability that a question has an unexpected answer? See this quiz: maybe the quizzer is trying to trick you, maybe he's trying to reverse-trick you, or maybe he just chose his questions at random. It's a meaningless exercise because you're being asked to estimate values from an unknown distribution. The only rational thing to do is guess at random.
People taking a calibration test should first see the answers to a sample of the data set they will be tested on.
Replies from: pengvado↑ comment by pengvado · 2010-01-10T01:08:24.869Z · LW(p) · GW(p)
I think the two of you are looking at different parts of the process.
"Amount of trickiness" is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.
Otoh, "estimate of the average trickiness of quizzes" is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn't to get that particular question right, it does cause a systematic bias when applying the results to every other situation.
comment by JamesAndrix · 2009-11-13T03:17:35.409Z · LW(p) · GW(p)
Wow, hmm I took quiz 1 so far, and all my high confidence answer groups all scored much lower. For now I blame too much experience with easy multiple choice tests. I only got 19 out of 50 overall
Another problem, I think :
You marked your answers to 17 questions as '25% accurate'. Out of these, 1 answers were correct, for a success rate of 5.8823529411765 percent.
Now I thought I was choosing 25% when I didn't know the answer, but this seems to indicate that I had some information, and was biased against playing my (sometimes correct) hunches when marking 25%.
comment by steven0461 · 2009-11-12T00:59:38.242Z · LW(p) · GW(p)
(I believe this is his LW account, but feel free to correct me)
This is my current LW account.
There were sequels to the Aumann game here here and here; these have better questions but probably the lack of auto-scoring makes it not worth the effort.
Replies from: alyssavance↑ comment by alyssavance · 2009-11-12T01:27:35.918Z · LW(p) · GW(p)
Added, thanks!
comment by jimmy · 2009-11-12T19:06:13.847Z · LW(p) · GW(p)
If anyone is thinking about creating their own, I would suggest questions with numerical answers so you can give upper and lower bounds of varying confidence, rather than trying to pick your confidence on a binary question and try to force binning or do some sort of filtering.
Also, this lets you give several probability estimates for each question.
Replies from: aretae↑ comment by aretae · 2009-11-13T17:36:32.667Z · LW(p) · GW(p)
Douglas Hubbard writes on the topic of calibration as well. He focuses on RW application of this stuff, and calibration is clearly a part of that.
His 1st book: http://www.amazon.com/How-Measure-Anything-Intangibles-ebook/dp/B001BPE8ZQ/ref=sr_1_3?ie=UTF8&s=books&qid=1258133710&sr=8-3
His site: http://www.hubbardresearch.com/dotnetnuke/
Replies from: gwern↑ comment by gwern · 2011-07-04T00:53:59.069Z · LW(p) · GW(p)
I found How to Measure Anything pretty interesting in its thorough application of calibration and Fermi calculation to all sorts of problems, although I didn't find the digressions into Excel very useful. Definitely recommended if you don't already have the mental knack for Fermi stuff.
comment by Kim Hokkanen (kim-hokkanen) · 2018-06-29T08:03:30.883Z · LW(p) · GW(p)
comment by A1987dM (army1987) · 2012-02-28T14:17:36.910Z · LW(p) · GW(p)
Question 40 of Salvatier's test:
#40: Which city is closer to Los Angeles, Calif.?
- Phoenix, Ariz.
- Phoenix, Ariz.
Wait... What?
Replies from: Viliam_Bur↑ comment by Viliam_Bur · 2012-02-28T14:26:50.128Z · LW(p) · GW(p)
The first one; with probability epsilon.
(alternatively: The second one; with probability epsilon.)
Next question! :D