Credence calibration game FAQ

academian

Credence calibration game FAQ

post by Academian · 2012-11-26T00:52:12.885Z · LW · GW · Legacy · 57 comments

57 comments

Hey rationality friends, I just made this FAQ for the credence calibration game. So if you have people you'd like to introduce to it --- for example, to get them used to thinking of belief strengths as probabilities --- now is a good time :)

Also, shameless promotion: please tweet/g+/like it; I want the world to be thinking in probabilities ASAP!

*Also*, please email me (critch@math.berkeley.edu) if you're good at making apps quickly and are interested in improving the game or making a variant of it; I'm swamped in job applications right now, but could easily have a Skype or phone conversation about our cache of ideas for improvements / variations (e.g. collecting user data on a server, more question types, a variant awarding gambles rather than deterministic scores, a variant with clickable emotion buttons for the user...).

Cheers!

57 comments

Comments sorted by top scores.

comment by lukeprog · 2012-11-26T01:57:02.205Z · LW(p) · GW(p)

Confusingly, this game has at least three titles:

The Credence Game (your website)
Calibration Game (Bent Spoon Games' website + the iOS app store)
Confidence Game (inside the app itself: "Welcome to the Confidence Game!")

Replies from: Academian

↑ comment by Academian · 2012-11-26T09:38:18.880Z · LW(p) · GW(p)

Yes, Alexei (aka Bent Spoon Games) and I talked about the name recently; to promote its use in university courses teaching Bayesian statistics, we're sticking with Credence Game. Confidence means something slightly different in statistics, and the game is meant to teach not just calibration, but also the act of measuring belief strength itself. The name update on BSG, and in the app itself as downloaded from there, will happen soon enough.

comment by asparisi · 2012-11-26T05:47:15.423Z · LW(p) · GW(p)

Suggestions (for general audience outside of LW/Rationalist circles)

I like the name "Confidence Game"- reminds people of a con game while informing you as to the point of the game.

Try to see if you can focus on a positive-point scale, if you can. Try to make it so that winning nets you a lot of points but "losing" only a couple. (Same effect on scores, either way) This won't seem as odd if you set it up as one long scale rather than two shorter ones: so 99-90-80-60-50-60-80-90-99.

Setting it to a timer will make it ADDICTIVE. Set it up in quick rounds. Make it like a quiz show. No question limit, or a bonus if you hit the limit for being "Quick on your feet." Make it hard but not impossible to do.

Set up a leaderboard where you can post to FB, show friends, and possibly compare your score to virtual "opponents" (which are really just scoring metrics) Possibly make those metrics con-man themed, keeping with the game's name.

Graphics will help a lot. Consider running with the con-game theme.

Label people: maybe something like "Underconfident" "Unsure" "Confident" "AMAZING" "Confident" "Overconfident" "Cocksure" (Test labels to see what works well!) rather than using graphs. Graphs and percentages? Turn-off. Drop the % sign and just show two numbers with a label. Make this separate from points but related. (High points=greater chance of falling toward the center, but in theory not necessarily the same.) Yes, I know the point is to get people to think in percentages, but if you want to do that you have to get them there without actually showing them math, which many find off-putting.

Set up a coin system that earns you benefits for putting into the game: extended round, "confidence streak" bonuses, hints, or skips might be good rewards here. Test and see what works. Allow people to pay for coins, but also reward coins for play or another mini-game related to play or both. (Investment=more play)

Replies from: asparisi

↑ comment by asparisi · 2012-11-26T05:50:26.081Z · LW(p) · GW(p)

Another thought: once you have a large bank of questions, consider "theme questions" as something people can buy with coins. Yes, that becomes a matter of showing off rather than the main point, but people LIKE to show off.

comment by pleeppleep · 2012-11-26T02:15:33.169Z · LW(p) · GW(p)

I feel that "rationality friends" should be a standard way of addressing Lesswrongers.

Replies from: Nisan, David_Gerard, Kindly

↑ comment by Nisan · 2012-11-26T05:55:21.586Z · LW(p) · GW(p)

Yes. Also, "Hear me, rat-people."

Replies from: Viliam_Bur, Nornagest

↑ comment by Viliam_Bur · 2012-11-26T14:59:49.228Z · LW(p) · GW(p)

Ratfolk.

↑ comment by Nornagest · 2012-11-26T06:15:49.557Z · LW(p) · GW(p)

You win.

↑ comment by David_Gerard · 2012-11-26T06:31:19.075Z · LW(p) · GW(p)

Super Rationality Adventure Pals the Saturday morning cartoon! On 1080p from a BitTorrent near you.

Replies from: Luke_A_Somers

↑ comment by Luke_A_Somers · 2012-11-26T11:26:50.549Z · LW(p) · GW(p)

We know we're winning when they begin making rationality friends music videos.

Replies from: David_Gerard

↑ comment by David_Gerard · 2012-11-27T21:05:47.295Z · LW(p) · GW(p)

The 5yo has been watching Pokemon anime by the ton. I'm picturing something with similar production values featuring the CFAR trials. A wise old fellow who seems to know just everything. People catch GREAT BIG Cognitive Biases out in the wild, then put these Cognitive Biases into training to evolve them smaller and smaller. RationalWiki guest-star as Team Rocket. It'll be huge.

Replies from: Alicorn

↑ comment by Alicorn · 2012-11-28T19:28:04.746Z · LW(p) · GW(p)

This would be cute, except you'd have to catch the biases in a dreamscape/mindscape of some kind, not in the "wild". (Why would you want to retrieve them, even for shrinking, unless they were already inside your head?)

Replies from: gwern, David_Gerard

↑ comment by gwern · 2012-11-28T21:10:30.306Z · LW(p) · GW(p)

'Arguments are soldiers' and 'reason is for persuasion'; hence, you obviously use biasemons on your opponents to make them make bad decisions and believe wrong things.

(It's not any more malicious than enslaving wild animals and making them fight until they are knocked unconscious by the trauma...)

↑ comment by David_Gerard · 2012-11-28T23:36:26.960Z · LW(p) · GW(p)

Because biases are so much easier to identify in other people, of course.

↑ comment by Kindly · 2012-11-27T15:42:18.829Z · LW(p) · GW(p)

Greetings, my rational comrades!

Replies from: pleeppleep

↑ comment by pleeppleep · 2012-11-27T15:54:22.037Z · LW(p) · GW(p)

No.

comment by DanielVarga · 2012-11-30T09:38:58.783Z · LW(p) · GW(p)

Nice game, BUT. The Mac version caused me some pain. I chose the default fullscreen settings, and I couldn't quit the application for minutes. No quit button that I could find. No drop-down menu bar. Cmd-tab stopped working. Mission Control (F3) stopped working. WTF? Finally I figured out that alt-cmd-esc still works, and then I could quit the game.

comment by wanderingsoul · 2012-11-27T10:06:47.841Z · LW(p) · GW(p)

Nice game, good to see someone making it easy to just practice being well calibrated.

My calibration started off wonky, (e.g. was wrong each of the first six times I claimed 70% certainty) but quickly improved. Unfortunately, it improved suspiciously well, I suspect I may have been assigning probabilities with my primary goal not being scoring points, but instead with trying to get that bar graph displayed every 5 or 10 questions to even out. It's a well designed game, but unfortunately at least for me the score wasn't the main motivator, which is a problem because the score is the quantity that increases by being helpfully well-calibrated. Anyone else have a similar experience?

Replies from: gjm

↑ comment by gjm · 2012-12-07T23:37:26.012Z · LW(p) · GW(p)

My experience is distinctly similar.

I observed another curiosity. For much of my time playing the game I've got a larger fraction of 50%s right than of 60%s. I think what's going on is that the 50% cases are ones where I definitely have no idea of the answer and have to fall back on heuristics (have I heard of this person? does the name sound old or recent? etc.) -- and the heuristics work better than I can bring myself to admit they do :-).

comment by gjm · 2012-12-07T23:48:00.771Z · LW(p) · GW(p)

A couple of times I've accidentally clicked twice in rapid succession on the "Continue" button, and the second click has been interpreted as an answer to the next question. (I think it happened to be a correct answer both times.)

I suggest disabling the buttons -- or at least the "B" ones -- for about 100ms after displaying a new question.

comment by Kindly · 2012-11-27T13:46:38.080Z · LW(p) · GW(p)

This claim seems incorrect:

[mean squared error] wouldn't penalize a credence of 99.999% on wrong answers enough to strongly dissuade unjustified absolute certainties.

Since the scoring rule is proper, the penalty for incorrectly being 99.999% confident is exactly high enough that you wouldn't want to say you are if you're merely 99.998% confident.

Unless you mean something different?

Replies from: ChristianKl

↑ comment by ChristianKl · 2012-12-03T22:13:30.559Z · LW(p) · GW(p)

I don't care about the US Postmaster General or about who won which sports championship. I don't care about boxing champions. I don't care about the various US sport teams.

I don't even want to learn that knowledge.

Nobel Prizes are more interesting. Who got the Nobel prize in Physiology or Medicine 2012? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman

Who got the Nobel prize in Physiology or Medicine for the discovery that mature cells can be reprogrammed to become pluripotent? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman

EDIT: (This comment should be top level, sorry...)

Replies from: Alexei

↑ comment by Alexei · 2012-12-06T04:01:33.537Z · LW(p) · GW(p)

Yeah, I feel you. I'm currently adding a feature where you specify what questions you want to be asked.

Replies from: Illia

↑ comment by Illia · 2019-08-04T07:00:19.924Z · LW(p) · GW(p)

Hey Alexei, I know it's been a while since this game was developed, but I've discovered it recently and I want to make a better one. More question types, community-contributed and voted questions, categories, profiles; cross-platform web app, themes, better UI etc.

It's of course going to be open-source.

Would you like to help with math model for questions and evaluating, or maybe with development?

galatyuk.ilya AT gmail

comment by evand · 2012-11-26T04:51:29.581Z · LW(p) · GW(p)

The difference between 60% credence and 80% credence seems much smaller to me than the difference between 90% and 99%. Is there a reason there's no option between 90% and 99%? In your testing, have you found any well-calibrated users who answer 99% a non-trivial fraction of the time?

Replies from: asparisi, Alexei, ChristianKl

↑ comment by asparisi · 2012-11-26T13:37:28.133Z · LW(p) · GW(p)

Well, in 11 out of 145 answers (7.5%) I so far have answered 99%, and I have yet to be wrong in any of my answers.

If I continue at this rate, in approximately 1,174 more answers, I'll be able to tell you if I am well callibrated (less, if I fail at more than one answer in the intervening time)

Replies from: asparisi

↑ comment by asparisi · 2012-11-26T13:38:31.508Z · LW(p) · GW(p)

To clarify, wrong in any of my answers at the 99% level. I have been wrong at other levels (including, surprisingly, hovering within around 1% of 90% at the 90% level.

↑ comment by Alexei · 2012-12-06T04:03:09.389Z · LW(p) · GW(p)

Yes, but it also depends on your knowledge base. Unfortunately right now we don't have a lot of questions, but I hope people will add more. Then, we hope, you'll have more questions for which you can answer 99%.

↑ comment by ChristianKl · 2012-12-03T22:47:56.601Z · LW(p) · GW(p)

I would say that I know with >99% certainity that Woodrow Wilson was president earlier than George H. W. Bush.

The same is true for the fact that the atom silver gets abbreviated with Ag.

comment by Jabberslythe · 2012-11-26T22:15:06.000Z · LW(p) · GW(p)

I have been looking for a good calibration test for a while! Thank you.

Here's a little critique: The questions are becoming repetitive after playing it for not too long. I have noticed that a few of the questions are wrong, as in it says that I am wrong and docks points but I am right according to the data it sites (I think some of these had to do with the dark knight).

Replies from: Jabberslythe

↑ comment by Jabberslythe · 2012-11-27T14:51:00.561Z · LW(p) · GW(p)

Also, many of the questions are US centric and many of the questions are about sports or popular music or other things that I would not expect the average person interested in calibration to knowledgeable about or interested in. If you are trying to make it take off these might be justified however, these could be good decisions.

Replies from: Alexei, ChristianKl

↑ comment by Alexei · 2012-12-06T04:04:38.934Z · LW(p) · GW(p)

Yes, I agree. We are working on adding more questions to the game.

I fixed the bug with the film question.

↑ comment by ChristianKl · 2012-12-03T23:38:50.259Z · LW(p) · GW(p)

When it comes to the boiling point of platinum I would however doubt that the average person has any interest.

comment by linkhyrule5 · 2013-08-04T03:20:13.278Z · LW(p) · GW(p)

One thing I might suggest is leaving the question on screen after the question has been answered, and loading the answer on the same screen. I keep indicating an answer I didn't mean to select, and I'm not sure if it's a bug or my twitchy fingers.

comment by ChristianKl · 2012-12-08T12:22:39.720Z · LW(p) · GW(p)

If you ask snopes maybe you can use there database of claims. You could start by throwing out all claims that snopes doesn't classify as true or false.

Example claim from http://www.snopes.com/business/money/money.asp: According to snopes the rumour that U.S. law specifies that a creditor does not have to accept more than 100 pennies towards the payment of a debt or obligation.

A) True - B) False

When the user answered the software presents him with "False" and gives him a link to the snopes website where the user can read more information.

Snopes would profit from having more people exposed to their database and those people being directed with link to the snopes website.

I don't know whether the Snopes folks would be open to having their database used in that way, but asking won't hurt.

comment by gjm · 2012-12-08T12:00:00.288Z · LW(p) · GW(p)

Bug report: I just got asked which of Cleopatra and Spider-Man 3 was more expensive, answered, and was told:

Incorrect. The right answer is A ($Bmillion) vs B ($Cmillion)

where C > B. I am having trouble understanding how this is even possible given how I'd have thought the game works. Some weird screwup related to inflation correction? Comparing the prices as strings rather than numbers? (B was shorter than C but with a larger first digit.)

[EDITED to add: hey, it just did it to me again. This time it was Cleopatra versus Troy. Same failure mode otherwise.]

comment by ChristianKl · 2012-12-06T20:44:23.849Z · LW(p) · GW(p)

Instead of having two different 50/50 buttons it might make sense to have only one button that spans both fields.

Believing A is true with 50% and believing B is true with 50% is the same belief when either A or B is true. The program shouldn't encourage the student to mentally treat the two categories as different. It should train the student to feel like he doesn't know the answer when he's 50/50.

Replies from: gjm

↑ comment by gjm · 2012-12-07T23:40:26.231Z · LW(p) · GW(p)

Except that I (and maybe others with similarly warped brains) find that my success rate in things I click "50%" for is distinctly above 50%. For much of my time playing the game I've actually got a larger fraction of "50%"s than of "60%"s right. If there were only one 50% button I'd have had no way of discovering that, which would be a shame because it clearly tells me something useful (specifically, I think, that my heuristics for guessing answers to some kinds of questions whose answers I have no explicit idea about are more effective than I thought).

comment by ChristianKl · 2012-12-03T23:48:43.753Z · LW(p) · GW(p)

There should be auditory feedback. Making a correct guess should encourage the program to play a pleasant sound.

Replies from: Alexei

↑ comment by Alexei · 2012-12-06T04:05:07.778Z · LW(p) · GW(p)

Good point! I'll add that to my feature todo list. :)

comment by ChristianKl · 2012-12-03T22:55:02.374Z · LW(p) · GW(p)

When I use the program I would enjoy it if I learn some content while I'm using the program. You could make a list of 2000 claims. 1000 true claims. 1000 false claims.

Type I error is the incorrect rejection of a true null hypothesis. A) Yes B) No

Type I error is is the failure to reject a false null hypothesis. A) Yes B) No

Pack all the knowledge that CFAR wants to teach in rationality minicamps into the program.

What's the name of the bias to see events that have already occurred as being more predictable than they were before they took place? A) Hindsight bias B) Confirmation bias

To improve learning you use the SRS in the background. Take the SuperMemo algorithm. You don't have to show the user in the open how many due cards there are. Whenever the user has completed all due cards, you add new cards. Whenever there are due cards, you add no new cards.

This approach has two big advantages: 1) The students learns two things at the same time. 2) The student is using calibrating his knowledge on questions that are more meaningful. That will increase the chance that he can use the calibration in his "real life".

Replies from: gjm, Alexei

↑ comment by gjm · 2012-12-07T23:41:26.936Z · LW(p) · GW(p)

"It's a dessert topping and a floor wax!"

I think trying to do two completely different things together would likely result in neither of them being done as well as they can be by two separate tools.

↑ comment by Alexei · 2012-12-06T04:09:13.150Z · LW(p) · GW(p)

One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.

I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.

Adding SRS sounds like a fun feature. I'll think about adding it.

Replies from: ChristianKl

↑ comment by ChristianKl · 2012-12-06T20:50:06.609Z · LW(p) · GW(p)

One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.

Regardles how you slice it, 50% of the time A will be the right answer and 50% B. Reality doesn't work like that. But I agree that when it possible to ask about a subjet in a way where A and B aren't Yes/No it might be a good idea to avoid using Yes/No

I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.

Great :)

Adding SRS sounds like a fun feature. I'll think about adding it.

In case you don't want to much repetition on the first days of using the software, you could add an additional rule: A maximum of 50% of the cards are recently answered, the other 50% of the cards are either new or older than 21 days.

I would also be possible to make some questions like those over CFAR knowledge SRS based while not doing SRS on other questions.

comment by ChristianKl · 2012-12-03T21:40:06.388Z · LW(p) · GW(p)

Having to use the mouse is not optimal.

It would be good to be able to use the keypad. 0-9 gives you a one key for every button.

could give you a new question and move past the graph.

Replies from: Alexei

↑ comment by Alexei · 2012-12-06T04:09:53.717Z · LW(p) · GW(p)

Good call! I'll add that feature in.

comment by khafra · 2012-11-29T11:04:07.394Z · LW(p) · GW(p)

This app is great; I've already shared it on Facebook and in person. One thing 2 of 3 people I shared with wanted was multiple user profiles. Not just for multiple people; but so a single user can track his calibration in the morning vs. the evening; under different emotional states and amounts of stress, etc.

Replies from: Alexei

↑ comment by Alexei · 2012-12-06T04:10:53.385Z · LW(p) · GW(p)

Interesting. I haven't heard anyone else request this feature, but I'll add it to my TODO list and might get to it eventually. Thanks!

Replies from: khafra

↑ comment by khafra · 2012-12-06T11:57:39.224Z · LW(p) · GW(p)

Is the codebase on github? I might try to write a patch, although I'd have to learn Objective C in the meantime.

Replies from: Alexei

↑ comment by Alexei · 2012-12-07T00:11:09.697Z · LW(p) · GW(p)

No, but it sounds like a good idea to put it up there. The project is written using Unity3D engine, and the code base is in C#. :)

comment by JoshuaFox · 2012-11-26T11:05:05.267Z · LW(p) · GW(p)

What's a good result, both in terms of the number and the graph? What are other people's results? Not that I want to be too competitive, but I have no idea if I am doing very well or very badly.

Replies from: asparisi

↑ comment by asparisi · 2012-11-26T21:16:45.314Z · LW(p) · GW(p)

High score seems to be good in terms of "My confident beliefs tend to be right."

Having your bars on the graph line up with the diagonal line would be an "ideal" graph (neither over- nor under- confident)

Replies from: JoshuaFox

↑ comment by JoshuaFox · 2012-11-27T08:34:58.402Z · LW(p) · GW(p)

What is a high score? I realize that there is no absolute scale, but I have no idea if 10 is good or 1000 is bad.

Replies from: ChristianKl, gjm, asparisi

↑ comment by ChristianKl · 2012-12-03T23:02:30.621Z · LW(p) · GW(p)

Out of 363 guesses my average score is 8. I'm from Germany so my knowledge of some of the US specific stuff isn't good.

↑ comment by gjm · 2012-12-07T23:46:12.108Z · LW(p) · GW(p)

One anecdata point: after 200 answers I have an average score of 12.6; I have success rates of 54% for "50%" answers, 59% for "60%" answers, 75% for "70%" answers, 88% for "80%" and "90%" answers (average 83%), and 100% for "99%" answers. (I've been very consistently underconfident.) I'm from the UK and for many of the sporting questions I couldn't even tell you what sport they're about. This feels to me like pretty good performance but I have little real basis for that opinion.

[EDITED to add: my counts are 33 @ 50%, 107 @ 60%, 40 @ 70%, 6 @ 80%, 3 @ 90%, 11 @ 99%. So lots of very unconfident answers.]

[EDITED again to add: Perhaps add an optional mode in which some score information gets shared, anonymously or otherwise, so that there can be a leaderboard and a display of many users' calibration graphs and so forth, for those who like to compete or to benchmark themselves.]

[EDITED again to add, though probably no one cares but me: after 400 questions all my statistics are basically the same as above, so apparently I'm (1) consistent and (2) a slow learner.]

Replies from: JoshuaFox

↑ comment by JoshuaFox · 2012-12-08T19:27:11.091Z · LW(p) · GW(p)

Thanks. FWIW, My average score was 13.

And yes, a sharing mode would be useful. As-is, we have no basis for comparison.

↑ comment by asparisi · 2012-11-27T12:49:18.800Z · LW(p) · GW(p)

Well, you can get up to 99 points for being 99 percent confident and getting the right answer, or minus several hundred (I have yet to fail at a 99 so I don't know how many) for failing at that same interval.

Wrong answers are, for the same confidence interval, more effective at bringing down your score than right answers are at bringing it up, so in some sense as long as you are staying positive you're doing good.

But if you want to compare further, you'd have to take into account how many questions you've answered, as your lifetime total will be different depending on the questions you answer. (990 after 10 questions would be exceptional: best possible score. 990 after 1,000 questions means you are getting a little less than a point per question, overall)

Credence calibration game FAQ

Contents

57 comments