Calibration Test with database of 150,000+ questions

post by Nanashi · 2015-03-14T11:22:58.943Z · LW · GW · Legacy · 32 comments

Contents

  **EDIT**
  Here are the changes that are done:
None
32 comments

Hi all, 

I put this calibration test together this morning. It pulls from a trivia API of over 150,000 questions so you should be able to take this many, many times before you start seeing repeats.

http://www.2pih.com/caltest.php

A few notes:

1. The questions are "Jeopardy" style questions so the wording may be strange, and some of them might be impossible to answer without further context. On these just assign 0% confidence.

2. As the questions are open-ended, there is no answer-checking mechanism. You have to be honest with yourself as to whether or not you got the right answer. Because what would be the point of cheating at a calibration test?

I can't think of anything else. Please let me know if there are any features you would want to see added, or if there are any bugs, issues, etc. 

 

**EDIT**

As per suggestion I have moved this to the main section. Here are the changes I'll be making soon:

Here are changes I'll make eventually:

 

Here are the changes that are done:

32 comments

Comments sorted by top scores.

comment by RowanE · 2015-03-13T17:37:15.339Z · LW(p) · GW(p)

I think the problem here is with many trivia questions you either know the answer or you don't; the dominant factor in my results so far is that I either have no answer in mind, assign 0 probability to my being right and am correctly calibrated there, and then all of my answers at other levels of certainty have turned out right so far so my calibration curve looks almost rectangular.

I might just be getting accurate information that I'm drastically underconfident, but I think this might be one of the worse types of questions to calibrate on. I mean, even if the problem is just that I'm drastically underconfident on trivia questions and shouldn't be assigning less than 50% probability to any of my answers when I have an answer, that sounds sufficiently unrepresentative of most areas where you need calibration, and how most people perform on other calibration tests, for this to be a pretty bad measure of calibration.

Perhaps it would be better as a multiple choice test, so one can have possible answers raised to attention that may or may not be right, and assign probabilities to those?

Replies from: falenas108, lukeprog, Nanashi
comment by falenas108 · 2015-03-14T17:47:17.788Z · LW(p) · GW(p)

My favorite calibration tools have been one where there was a numerical answer and you had to express a 50% confidence interval, or 90% confidence interval.

Like, a question would be how many stairs are there in the Statue of Liberty? And my 50% interval would be 400-1000, and my 90% interval would be 200-5000.

Looking up the answer it was 354, and I would mark my 50% as wrong and my 90% as right.

comment by lukeprog · 2015-03-13T18:03:23.961Z · LW(p) · GW(p)

0% probability is my most common answer as well, but I'm using it less often than I was choosing 50% on the CFAR calibration app (which forces a binary answer choice rather than an open-ended answer choice). The CFAR app has lots of questions like "Which of these two teams won the Superbowl in 1978" where I just have no idea. The trivia database Nanashi is using has, for me, a greater proportion of questions on which my credence is something more interesting than an ignorance prior.

comment by Nanashi · 2015-03-13T18:40:00.505Z · LW(p) · GW(p)

That's a fair criticism, but if we're going down this road we've also gotta recognize the limitations of a multiple choice calibration test. Both styles suffer from the "You know it or you don't" dichotomy. If these questions were all multiple choice, you'd still have the same rectangular shaped graph, it would just start at 50% (assuming a binary choice) instead of 0%.

The big difference is the solution sets that the different styles represent. There are plenty of situations in life where there are a few specific courses of action to choose from. But there are also plenty of situations where that's not the case.

But, I will say that a multiple choice test definitely yields a "pretty" calibration curve much faster than an open-ended test. You've got a smaller range of values, and the nature of it lets you more confidently rule out one answer or the other. So the curve will be smoother faster. Whereas this will be pretty bottom heavy for a while.

Replies from: Sarunas
comment by Sarunas · 2015-03-14T00:03:02.491Z · LW(p) · GW(p)

I think the problem here is with many trivia questions you either know the answer or you don't

That means that for those questions most probabilities are either close to 0 or close to 1. This suggests that given this set of questions it would probably be a good idea to increase "resolution" near those two points. For that purpose, perhaps instead of asking for confidence levels expressed as percentages you could ask for confidence levels expressed as odds or log odds. For example, users could express their confidence levels using odds expressed as ratios 2^n:1, for n=k,...,0,...,-k.

Replies from: Nanashi
comment by Nanashi · 2015-03-14T11:28:51.553Z · LW(p) · GW(p)

That's an interesting thought but I do suspect that you'd have to answer a lot of questions to see any difference whatsoever. If you're perfectly calibrated and answer 100 questions that you are either 99.99% confident or 99.9% confident, there's a very good chance that you'll get all 100 questions right, regardless of which confidence level you pick.

comment by lukeprog · 2015-03-13T17:39:19.178Z · LW(p) · GW(p)

Awesome!

I've been dying for something like this after I zoomed through all the questions in the CFAR calibration app.

Notes so far:

  • The highest-available confidence is 99%, so the lowest-available confidence should be 1% rather than 0%. Or even better, you could add 99.9% and 0.1% as additional options.
  • So far I've come across one question that was blank. It just said Category: jewelry and then had no other text. Somehow the answer was Ernest Hemingway.
  • Would be great to be able to sign up for an account so I could track my calibration across multiple sessions.
Replies from: Nanashi, RowanE
comment by Nanashi · 2015-03-13T18:09:41.342Z · LW(p) · GW(p)

Re: 0%, that's fair. Originally I included 0% because certain questions are either unanswerable (due to being blank, contextless, or whatnot) but even then there's still a non-zero possibility of guessing the right answer out of a near-infinite number of choices.

Re: Calibration across multiple sessions. Good idea. I'll start with a local-based solution because that would be easiest and then eventually do an account-based thing.

Re: Blank questions. Yeah, I should probably include some kind of check to see if the question is blank and skip it if so.

Replies from: lukeprog
comment by lukeprog · 2015-03-13T21:57:50.093Z · LW(p) · GW(p)

Thanks! BTW, I'd prefer to have 1% and 0.1% and 99% and 99.9% as options, rather than skipping over the 1% and 99% options as you have it now.

Replies from: Nanashi
comment by Nanashi · 2015-03-14T11:38:13.024Z · LW(p) · GW(p)

I considered that but I think at least for now it may just overcomplicate things for not a ton of benefit. Subjectively it seems that out of 100 questions, there are maybe 10 that I would assign the highest possible confidence. Of those I'd say only 1 out them would be questions that I'd pick 99% confidence if it were available instead of, say, 99.9%.

So assuming (incorrectly) that I'm perfectly calibrated it would take about 7000 questions in order to stand a >50% chance of seeing a meaningful difference between the two confidence levels.

comment by RowanE · 2015-03-13T17:44:46.564Z · LW(p) · GW(p)

It's possible to be, to some extent, certain that you haven't thought of a correct answer (if not certain you don't know the answer), because you don't have any answer in mind and yet are not considering the answer "this is a trick question" or "there is no correct answer". Is this something that should be represented, making "0%" correct to include, or am I confused?

I got one blank question, which I think was an error with loading since the answer came up the same as the previous question, and the one after it took a couple seconds to appear on-screen.

Replies from: lukeprog
comment by lukeprog · 2015-03-13T17:59:17.922Z · LW(p) · GW(p)

I'd prefer not to allow 0 and 1 as available credences. But if 0 remained as an option I would just interpret it as "very close to 0" and then keep using the app, though if a future version of the app showed me my Bayes score then the difference between what the app allows me to choose (0%) and what I'm interpreting 0 to mean ("very close to 0") could matter.

Replies from: owencb
comment by owencb · 2015-03-13T19:56:56.668Z · LW(p) · GW(p)

I think it's misleading to just drop in the statement that 0 and 1 are not probabilities.

There is a reasonable and arguably better definition of probabilities which excludes them, but it's not the standard one, and it also has costs -- for example probabilities are a useful tool in building models, and it is sometimes useful to use probabilities 0 and 1 in models.

(aside: it works as a kind of 'clickbait' in the original article title, and Eliezer doesn't actually make such a controversial statement in the post, so I'm not complaining about that)

Replies from: lukeprog
comment by lukeprog · 2015-03-13T21:55:55.311Z · LW(p) · GW(p)

Fair enough. I've edited my original comment.

(For posterity: the text for my original comment's first hyperlink originally read "0 and 1 are not probabilities".)

Replies from: owencb
comment by owencb · 2015-03-13T22:16:04.030Z · LW(p) · GW(p)

Perfect, thanks!

comment by owencb · 2015-03-13T20:21:32.181Z · LW(p) · GW(p)

Thanks for providing this!

I have a worry about using trivia questions for calibration: there's a substantial selection effect in the construction of trivia questions, so you're much more likely to get an obscure question pointing to a well-known answer than happens by chance. The effect may be to calibrate people for trivia questions in a way that transfers poorly to other questions.

comment by Shmi (shminux) · 2015-03-13T23:54:27.663Z · LW(p) · GW(p)

I think this should be copied/moved to Main. A calibration tool certainly deserves wider circulation.

comment by Val · 2015-03-19T15:11:38.056Z · LW(p) · GW(p)

A nice feature would be to mark a question as unanswerable, and if it gets enough flags you could overview and delete them. I just recently came across a question which asked what can be seen in "this image". Without any image attached, of course.

Replies from: DSimon
comment by DSimon · 2015-04-07T13:33:11.934Z · LW(p) · GW(p)

It would probably be best to just remove all questions that contain certain key phrases like "this image" or "seen here". You'll get a few false positives but with such a big database that's no great loss.

comment by kilobug · 2015-03-20T14:33:00.889Z · LW(p) · GW(p)

Interesting idea, thanks for doing it, but saddly many questions are very US-centric. It would be nice to have some "tags" on the questions, and let the users select which kind of questions he wants (for example the non-US people could remove the US-specific ones).

comment by Val · 2015-03-13T21:19:13.069Z · LW(p) · GW(p)

Should it be specified (or should the answerer tell) whether we are from the USA or not? A lot of questions seem to be very USA-centric, so the confidence can heavily depend on whether we live in the USA or not.

comment by Shaedys · 2024-06-08T06:49:05.624Z · LW(p) · GW(p)

Note, website seems broken now. Still loads, but the questions don't, and there is only 1 question without an answer.

comment by Nanashi · 2015-03-16T16:24:45.936Z · LW(p) · GW(p)

I made a major update to the interface to make it look prettier. I've tested this in Chrome but please let me know if it doesn't work in any other OS or browser.

I also added Google Analytics so I can see where people are accessing this from.

comment by Luke_A_Somers · 2015-03-13T20:11:20.759Z · LW(p) · GW(p)

A link or button to flip your last right/wrong would be nice. I had assigned 0% confidence for one question and accidentally said I got it right. Misclicks aren't the same as poor calibration.

Also, a little more on how to use it would make sense - the first one or two I did, I thought it was, 'how confident are you that this assertion is true' and I thought it was very oddly phrased. Then I realized.

Replies from: Nanashi
comment by Nanashi · 2015-03-13T20:12:55.216Z · LW(p) · GW(p)

Got it. I'll make them color coded and farther apart.

I'll write some better instructions as well.

Replies from: Luke_A_Somers
comment by Luke_A_Somers · 2015-03-29T22:50:23.046Z · LW(p) · GW(p)

What would help most is: "Pick an answer. How confident are you that your answer is correct?"

Then, make sure that when the user clicks the 'show answer' button, make sure that neither of the two new buttons are in the same place.

ALSO, it would be nice if the calibration curve showed the credible interval for each bin, so I can tell at a glance that my getting 1/1 right at 30% and 0/1 right at 60% isn't actually that big a hit to my calibration.

And if the second graph was stacked so that I don't have this giant red bar at 100%, which just looks odd. If it was red behind/on-top-of green, that would make the most sense (if stacked on top, you will obviously need to take the difference to maintain the sense of the graph).

Do you intend to curate out questions that are impossible/require additional clarifications like Alex would have given in advance or people would have worked out from the easy ones?

comment by UnclGhost · 2015-03-13T20:00:41.774Z · LW(p) · GW(p)

Great tool. Does the API you're using allow unanswerable questions to be flagged at all though? Just got one question that depended on an image that wasn't there, and another with no question body. Also, labeled axes on the graph might be nice for people who don't already know how calibration curves work and/or don't like unlabeled axes.

Replies from: Nanashi
comment by Nanashi · 2015-03-13T20:11:45.742Z · LW(p) · GW(p)

It does but only if you're hosting the API database on your own server. Which I will probably do sooner rather than later. I might implement a "skip" option for unanswerable questions. But selecting 0% does the same thing pretty much.

comment by alliumnsk · 2015-03-23T16:20:01.863Z · LW(p) · GW(p)

I noticed such thing that if previous question was estimated to have 60% confidence and the next one is supposed to be 70% I sometimes leave it as 60 due to laziness.

comment by Kindly · 2015-03-13T22:08:50.263Z · LW(p) · GW(p)

Well, I'm getting a reasonably exciting calibration curve with lots of ups and downs. Cool!

Bug: when I click "Display Calibration Curve" for a second time, the graph is displayed in a larger size. (Doing this sufficiently many times crashed Chrome.) Refreshing the page fixes this behavior.

Feature request: I would like to be able to see if my 50% correctness for 30% confidence is getting 1 out of 2 questions right or 5 out of 10. (Error bars of some sort would also work.)

Replies from: Nanashi
comment by Nanashi · 2015-03-13T22:50:16.692Z · LW(p) · GW(p)

Good idea. I don't think the charts API I'm using will let me do error bars but a good alternative would be a secondary chart that's a bar graph of right vs total questions for each bucket. This would also give a good visual representation of the frequency with which you use various confidence levels.

comment by [deleted] · 2015-03-13T20:26:48.668Z · LW(p) · GW(p)

You rock!