A fun estimation test, is it useful?

mwengler

A fun estimation test, is it useful?

post by mwengler · 2010-12-20T21:09:37.533Z · LW · GW · Legacy · 50 comments

  I highly recommend you take the test before reading any more.  
None
50 comments

So you think its important to be able to estimate how well you are estimating something? Here is a fun test that has been given to plenty of other people.

I highly recommend you take the test before reading any more.

http://www.codinghorror.com/blog/2006/06/how-good-an-estimator-are-you.html

The discussion of this test at the blog it is quoted in is quite interesting, but I recommend you read it after taking the test. Similarly, one might anticipate there will be interesting discussion here on the test and whether it means what we want it to mean and so on.

My great apologies if this has been posted before. I did my bast with google trying to find any trace of this test, but if this has already been done, please let me know and ideally, let me know how I can remove my own duplicate post.

PS: The Southern California meetup 19 Dec 2010 was fantastic, thanks so much JenniferRM for setting it up. This post on my part is an indirect result of what we discussed and a fun game we played while we were there.

50 comments

Comments sorted by top scores.

comment by Vaniver · 2010-12-21T05:52:44.898Z · LW(p) · GW(p)

I got 8 right. The ones I got wrong were the Great Lakes (upper bound was too small by a factor of 1e8) and the currency in circulation (my range was .75-1.25T, and the right answer was slightly lower than my lower bound).

The problem I have with this is while it's structured well to punish overconfidence, it's not structured well to publish underconfidence. You can do better than 95% of respondents (according to the graph the author posted) if you write down -inf for your lower bound and +inf for your upper bound, since all 10 will be within your range (and so your error magnitude is 1, just as if you got 8 right). The fact that you're twice as bad at estimating as someone who writes down 0 as a lower bound and +inf as a upper bound shows up nowhere.

The best test of estimation ability would probably be: each guess costs you log(upper)-log(lower), and a guess that includes the answer gives you Y points. Y determines the minimum level of knowledge you need to guess, but if you narrow your guess you improve your score. You could then figure out what confidence interval people are using from the width they select, and see how that compares with their reported confidence.

Replies from: mwengler

↑ comment by mwengler · 2010-12-21T06:08:31.920Z · LW(p) · GW(p)

This comment reflects how I realized I could game this... AFTER I took it though. Guess -inf to +inf for 9 of them, and something quite tight for the 10th. Then you've got your 90%.... but you don't really.

comment by cata · 2010-12-20T23:26:24.876Z · LW(p) · GW(p)

I got eight out of ten. I thought that the Pacific coastline was much shorter (but I was imagining it as basically a smooth curve, and terminating it at the Bering Strait and somewhere down near India -- is that how they measured it, I wonder?), and I thought blue whales were much less heavy. Now I have to go find out how something that weighs 190 tons manages to propel itself!

I was surprised by one particular aspect of the blog comments; the recurring theme of "I don't think it's valuable to make estimates like this, because if I ever gave a project estimate of 'two to twenty weeks' I would be laughed out of the room."

What do you mean, you would be laughed out? If you're 90% confident that it's two to twenty weeks, then you should feel OK about saying so -- that's better than confidently saying "five weeks" and watching it turn out to be twenty! At least it should be better. It's a shame if people feel compelled to whitewash guesses like that to save face.

comment by HonoreDB · 2010-12-20T21:50:09.104Z · LW(p) · GW(p)

It's an interesting phenomenon (and yes, my intervals were too small). I was actually expecting the reverse. His explanation reminds me of the conjunction fallacy--we overvalue specificity.

Total length of the coastline of the Pacific Ocean

Isn't this a classic example of a divergent series, though?

Replies from: datadataeverywhere, jferguson

↑ comment by datadataeverywhere · 2010-12-20T22:52:32.787Z · LW(p) · GW(p)

Yes, but that doesn't mean we haven't decided on a uniform way to agree on a system that produces finite measurements.

↑ comment by jferguson · 2010-12-21T00:47:56.119Z · LW(p) · GW(p)

I interpreted it as "the length of the coastline as represented on a high-detail world map", which got me a good estimate.

comment by Normal_Anomaly · 2010-12-21T12:39:28.606Z · LW(p) · GW(p)

I found another test that's more comprehensive. It has lots more questions, lets you give a confidence estimate for each, and tells you how well calibrated you are at 0% to 100% probability. And it notes both underconfidence and overconfidence.

http://www.projectionpoint.com/test1.php

I got a 78 out of 100.

Replies from: Emile

↑ comment by Emile · 2010-12-22T22:22:00.067Z · LW(p) · GW(p)

I got 73.

I didn't find this test as good as the other one:

1) In the estimating test, you have to figure out things in a void, with no clue from the question. But in this test, if the question is whether Sarah Blogg was Humphrey Bogart's second wife, my estimate goes from 0.00001% to 50%. So I often find myself guessing whether it's a trick question.

2) The results don't seem to take accuracy into account, meaning you might get perfect score by answering "50%" on all question (I haven't tried). Seeing a log scoring system would be better. (But then I didn't dig too much for their formula)

3) Their graph is ugly. The vertical don't line up with the numbers at the bottom! Geez!

Replies from: Normal_Anomaly

↑ comment by Normal_Anomaly · 2010-12-23T01:54:39.254Z · LW(p) · GW(p)

1) I like having at least some data; I still found myself using all 10 options at least once. That is, the test still relied to a large extent on my prior knowledge.

2) You're right about this. I tried and they don't; guessing 50% every time got me a perfect. I don't know enough about designing these things to make one with a log scoring rule, but it would definitely be nice to see one.

3) Ooh, that is weird. The gridlines don't seem to mean as much as the actual numbered labels; taking them off would make this go away.

It seems like neither of these tests is able to measure both calibration and discrimination.

comment by bentarm · 2010-12-20T23:06:03.003Z · LW(p) · GW(p)

Interesting quote from the comments in the explanation:

"I have even heard that when the country of Italy puts out a work request, they take all of the companies estimates and average them. The company closest to average gets the contract."

This sounds like an excellent idea, and should presumably incentivise contractors to give more accurate estimates than they would normally. Are there other methods of getting estimates from contractors that reward accurate estimates?

The problem being, of course, that it would always be difficult to punish contractors for finishing work early, as this gets incentives wrong in a different way.

Replies from: gwern

↑ comment by gwern · 2010-12-26T21:08:42.572Z · LW(p) · GW(p)

Also sounds like it incentivises collusion; company A and B take turns on contracts - one bids a ridiculously high sum that will skew the average, and the other turns in a more reasonable but inflated bid. The inflated bid is closer to the skewed average, and whomever's turn it is profits quite a bit.

comment by gimpf · 2010-12-21T22:31:57.413Z · LW(p) · GW(p)

Either I am the only irrational person here, or there is a strong publication bias. 4 out of 10. Once I was off nearly 3 orders of magnitude.

Replies from: Emile, None, Aharon

↑ comment by Emile · 2010-12-22T21:41:20.302Z · LW(p) · GW(p)

For the great lakes I was off by so many order of magnitudes I'm too embarassed to go count them. Probably about ten.

Replies from: sfb, Perplexed

↑ comment by sfb · 2010-12-23T03:49:02.919Z · LW(p) · GW(p)

"Ten thousand trillion litres should cover it!"

"Nope"

Oops.

↑ comment by Perplexed · 2010-12-22T22:26:07.195Z · LW(p) · GW(p)

Me too. 5 out of 10, and the ones I missed were close, except for that one. Couldn't figure out how I was that far wrong. So I took another look at the answers.

23,000 cubic kilometers
6.8 x 10^20 cubic meters

The first seems reasonable. 230km by 100km by 1km deep. The second seems ... wrong and just weird. 2.3 x 10^4 cubic km would be 2.3 x 10^13 cubic meters.

↑ comment by [deleted] · 2010-12-22T05:45:28.776Z · LW(p) · GW(p)

I only had 5 correct even though I knew about the involved bias from several sources and had done the (more extensive) test on http://www.projectionpoint.com/test1.php some time ago.

My main problem was the sheer scale of some problems, like the volume of the Great Lakes, which screwed up all the calculations in my head. Also, I did actually kinda-know a few of these items, but misremembered them and overconfidently didn't adjust my margins for safety.

Still, hitting a target with over 10 orders of magnitude in range isn't exactly accuracy.

↑ comment by Aharon · 2010-12-21T22:53:46.929Z · LW(p) · GW(p)

nah, I only got 4, too :(

comment by rwallace · 2010-12-21T04:07:26.375Z · LW(p) · GW(p)

I got all 10 right, but I had previously heard of similar results, so I don't know whether I would have otherwise known to be that careful.

Of course, for questions where I really didn't know the answer, I had to give a range spanning more than an order of magnitude to reliably hit the target (in some cases I could have tightened up my range, but in one case I only barely got it as it was); but I still think that's better than giving confident but wrong answers.

Replies from: ArisKatsaris, Oscar_Cunningham

↑ comment by ArisKatsaris · 2010-12-21T13:51:47.191Z · LW(p) · GW(p)

Getting all 10 rights, just means you gave too wide ranges. You were asked to have 90% certainty, so the "perfect" score is 9 correct answers out of 10. :-)

Replies from: Emile, rwallace, wedrifid

↑ comment by Emile · 2010-12-21T15:14:54.234Z · LW(p) · GW(p)

If I got 9 right and someone else got all 10 right and gave narrower ranges than I did, I'd say he's probably better at estimating than I am.

Replies from: FAWS, ArisKatsaris

↑ comment by FAWS · 2010-12-21T16:38:15.816Z · LW(p) · GW(p)

Better discrimination, but worse calibration (probably, low confidence since it's only a single data point).

↑ comment by ArisKatsaris · 2010-12-21T15:30:00.479Z · LW(p) · GW(p)

He'd better at estimating the answers themselves, but he'd be worse at estimating his ability to estimate.

Replies from: mwengler

↑ comment by mwengler · 2010-12-21T17:21:32.845Z · LW(p) · GW(p)

To be fair, 90% confidence means 90% on average. From one test like this, I'm not sure you could conclude much difference in ability to estimate or synthesize confidence levels between people who score 8, 9, and 10. Indeed, because of the gaming ability for picking 9 with -inf to inf bounds and one with tight bounds to force a 9, I would weight a 10 achieved with tighter bounds as better at confidence estimation as a 9 achieved with wildly different or generally wider confidence bounds.

↑ comment by rwallace · 2010-12-21T14:00:15.293Z · LW(p) · GW(p)

But I almost got a couple wrong :)

↑ comment by wedrifid · 2010-12-21T17:32:58.200Z · LW(p) · GW(p)

You were asked to have 90% certainty, so the "perfect" score is 9 correct answers out of 10. :-)

I question that metric of 'perfection'. I got said 'perfect' score by estimating, among other things, a blue wale weighing in at between 10 and 3^^^3 kg and a Sun with a surface temperature of negative one degrees Kelvin.

Replies from: ArisKatsaris

↑ comment by ArisKatsaris · 2010-12-26T15:35:23.830Z · LW(p) · GW(p)

That just means you lied to the test, which made it useless in determining your capacity to estimate certainty levels.

Try for an honest attempt next time, then it'll help you better.

Replies from: wedrifid

↑ comment by wedrifid · 2010-12-26T15:47:17.968Z · LW(p) · GW(p)

That just means you lied to the test, which made it useless in determining your capacity to estimate certainty levels.

No, what it means is that your description of the "perfect" score is wrong. Emphasis on "your" because the test itself makes no such declaration, leaving scope for a nuanced interpretation (as others have provided here).

Try for an honest attempt next time, then it'll help you better.

It is not relevant (see above) but this too may be mistaken. Tests that are foiled by 'lying to them' are bad tests. Making a habit of engaging with them is detrimental to rational thinking. They measure and encourage the development of the ability to deceive oneself - a bias that comes naturally to humans. "Sincerity" is bullshit.

Replies from: ArisKatsaris

↑ comment by ArisKatsaris · 2010-12-26T21:53:12.417Z · LW(p) · GW(p)

Tests that are foiled by 'lying to them' are bad tests.

Really? What test can you imagine that checks your ability at anything which can't be foiled by intentionally attempting to foil it?

A test that measures your speed at running can be foiled if you don't run as best as you can. A test that measures your ability to stand still can be foiled if you intentionally move. And a test that measures your intelligence can be foiled if you purposefully give it stupid answers. Which is what you did.

Perhaps you mean that this would be a bad test for someone to use to evaluate others, as people can also foil the test in an upwards direction, not just a downwards one.

Making a habit of engaging with them is detrimental to rational thinking.

Citation needed.

"Sincerity" is bullshit.

No, sincerity is the opposite of bullshit. I didn't have much of a trouble typing the range I actually believed gave me roughly a 90% chance. You on the other hand chose to type nine ranges that gave 100% chance, and one range that gave 0% chance.

So I was measurably, quantifiably, more sincere than you in my answers

Replies from: wedrifid

↑ comment by wedrifid · 2010-12-27T05:06:06.950Z · LW(p) · GW(p)

A test that measures your speed at running can be foiled if you don't run as best as you can. A test that measures your ability to stand still can be foiled if you intentionally move. And a test that measures your intelligence can be foiled if you purposefully give it stupid answers. Which is what you did.

You are being silly. Self sabotage is not what we are talking about here and not relevant. In fact, if your definition of a 'perfect score' was actually what the test was talking about then you would be self sabotaging. See my previous support of the test itself and advocacy of a more nuanced evaluation system than integer difference minimization.

No, sincerity is the opposite of bullshit.

"Sincerity is bullshit." is actually a direct quote a from On Bullshit. Those people here that use the term bullshit tend to mean it in the same sense described in that philosophical treatise.

I never reward people, even myself, for self deception.

↑ comment by Oscar_Cunningham · 2010-12-21T10:04:09.698Z · LW(p) · GW(p)

More than an order of magnitude! My answers often crossed six orders of magnitude, and I still only got 5/10!

Replies from: rwallace

↑ comment by rwallace · 2010-12-21T15:42:47.277Z · LW(p) · GW(p)

My estimate for the volume of the Great Lakes spanned several orders of magnitude, because I multiplied the uncertainties in all three dimensions.

Which has relevance to real scenarios: an estimate with several independent uncertainties had better give a range, if not strictly the product of all of them, at least wider than an estimate with just one similar uncertainty.

comment by datadataeverywhere · 2010-12-20T22:56:14.758Z · LW(p) · GW(p)

I got 8 / 10, and was very close on one of the incorrect guesses (33,000 - 133,000km for the coastline, reported as >135,600km). On the other hand, I used much more than the allotted 10 minutes, and did a Fermi calculation for each value. I'm not sure why that should be disallowed.

Replies from: sfb

↑ comment by sfb · 2010-12-22T23:24:08.828Z · LW(p) · GW(p)

Because you could have more quickly raised your estimate by a large amount to make sure you got it "right".

Why spend a long time getting 133,000Km when you could have put 133,000,000,000,000,000,000,000,000Km?

Because, the author claims, you think "narrow range is better, looks smarter" even though that's not what was asked for. You spent a long time making it 'more accurate' and consequently got it wronger wrt. to what the question was asking for.

Replies from: datadataeverywhere

↑ comment by datadataeverywhere · 2010-12-23T05:20:12.448Z · LW(p) · GW(p)

I realize that an easy way to cheat is to answer (0 (appropriate unit), 3^^^3 (appropriate unit))for questions 1-9, and answer (pi^e, pi^e) for question 10. That seems to be the "wrong" way to go about this task.

I wanted to deliver, to the best of my current knowledge (without looking anything up), 5% and 95% bounds for the true value, for each item. Where my knowledge was more limited, that meant a wider bound, but that shouldn't mean less effort to establish that bound, should it? That seems to be what you're implying.

Obviously, we need to learn that narrower ranges are not better, but if we want 90% ranges, we should work to ensure that the ranges are as close to 90% as our knowledge allows, not 99% just because we're reversing one kind of stupidity in order to achieve another.

comment by albert · 2011-12-07T01:25:45.854Z · LW(p) · GW(p)

How do you measure a coastline? Isn't it a subjective measurement depending on the scale of precision/resolution (since the actual geometry is fractal)

I skipped that question since I couldn't figure out the standard

See this for instance: http://en.wikipedia.org/wiki/Coastline_paradox

comment by Desrtopa · 2010-12-21T23:08:17.124Z · LW(p) · GW(p)

8 out of 10, and one of the ones I got wrong was the blue whale one, where I undershot by a tiny margin (if I had given my value in metric tons, I would have been right, but I know I meant short tons.)

I would have gotten about half of them wrong, but then I looked over them and realized I was leaning too heavily on an expectation of actual knowledge, and if I didn't actually know the answers with any precision, I should just be providing myself with vast margins of error.

comment by RobinZ · 2010-12-21T22:28:46.628Z · LW(p) · GW(p)

I call shenanigans on some of those answers - I had the surface temperature of the Sun to two significant figures, and he rounded it to one. :P

To my credit, though, I hit 30% through honest stupidity, rather than misunderstanding calibration. My understanding of geography is clearly worse than Columbus's, for example.

comment by red75 · 2010-12-21T15:01:16.041Z · LW(p) · GW(p)

7 of 10. I underestimated Asian (Eurasian?) continent area by factor 4 (safety margin one order of magnitude), quantity of US dollars by factor 10 (safety margin 3 orders of magnitude) and volume of gr. lakes by factor 0.1 (safety margin 3 orders of magnitude). Other safety margins were 3 orders of magnitude for Titanic, Pacific coast (fractal-like curves can be very long), book titles, and 0.5 from mean value for others. Sigh, I thought I'll have 90%.

Hm, I estimated area of Asian continent as area of triangle with 10000km base (12 timezones for 20000 km and factor of 0.5 for pole proximity) and 10000km height (north pole to equator), and lost one order of magnitude in calculation.

comment by Emile · 2010-12-21T13:54:41.306Z · LW(p) · GW(p)

I got 8 out of 10 too (underestimated the lakes by ten orders of magnitude >_> and my upper bound for the whale was close (I had revised it from 100 to 150 tons, still wasn't enough)), without gaming the test in obvious ways.

Again, I found there's a good deal of people in the comments who failed the test and instead of noticing how they could improve themselves, start making up excuses for how they andswered the right way anyway. 90% estimate means 90%, not "What you would answer if your pointy-haired boss asked you for a 90% estimate"!

I do hope that if I did fail a test I wouldn't do that. Hopefully ranting in public about the stupidity of people who explain their failures away instead of acknowledging them will make me more likely to be honest :)

comment by ArisKatsaris · 2010-12-21T13:52:06.011Z · LW(p) · GW(p)

I got 7 right. Two I got wrong (I overestimated Asia's area & Titanic's tickets). One I misread (the Great Lakes, I tried to calculate their surface area, instead of their volume).

comment by taw · 2010-12-21T11:37:29.923Z · LW(p) · GW(p)

8 out of 10.

I got area of Asia wrong in an interesting way. My estimated area of Asia to area of Earth was correct. My estimated area of Earth was wrong because I misremembered the formula for computing sphere surface from sphere diameter.

I got Pacific very wrong by essentially estimating length of extremely smoothed coastline, not real one.

Both were clear cases of overconfidence.

Things I was fully aware I knew very little about like Great Lakes or numbers of books published, I gave suitably wide ranges, and hit both correctly.

comment by Normal_Anomaly · 2010-12-21T01:10:41.848Z · LW(p) · GW(p)

I got 5 right, and was off by 1 degree of latitude on the Shanghai one. I kinda knew I was nowhere near 90% confidence, because 5 of my estimates spanned 2 or more orders of magnitude and I wanted them to be at least somewhat meaningful.

Replies from: rwallace

↑ comment by rwallace · 2010-12-21T04:10:55.801Z · LW(p) · GW(p)

I am still of the opinion, though, that if I think an estimate spanning two or more orders of magnitude (as some of mine did in this test -- that's the only way I was able to get them all right) would be considered meaningless/badly received, it's better to say "I don't know" than claim accuracy I know I don't have.

Replies from: jferguson, Emile, Normal_Anomaly

↑ comment by jferguson · 2010-12-21T06:08:50.593Z · LW(p) · GW(p)

Not ironically, there are ancient posts from Elizier and Robin concerning exactly this: "I Don't Know." and "You Are Never Entitled to Your Opinion"

Replies from: Sniffnoy

↑ comment by Sniffnoy · 2010-12-21T23:22:30.308Z · LW(p) · GW(p)

Actually I found the exercise interesting for that reason. On most of them I had what I considered no idea, but the requirement to get actual numbers forced me to clarify just what the limits on "don't know" were. (Only one I got wrong by its standards was the Pacific coastline one. I did the area/volume ones by starting by estimating the size of Connecticut...)

↑ comment by Emile · 2010-12-21T21:55:23.205Z · LW(p) · GW(p)

Saying the weight of the heaviest whale is "somewhere between 1 and 1000 tons" is just a nerdy and technical way of saying "I have no frickin' idea".

↑ comment by Normal_Anomaly · 2010-12-21T12:18:24.275Z · LW(p) · GW(p)

Definitely. In the real world, if somebody had asked me the length of the Pacific coastline or the number of books published in the US, I would say I had no clue. I do like this test even though I'm kvetching about it, it's interesting and maybe useful.

comment by jferguson · 2010-12-21T00:53:37.947Z · LW(p) · GW(p)

I think it's an important skill in general to be able to estimate things, though I might just think that because I got a 9/10 on that test.

Good estimation may not always be useful in the real world if you're giving someone else an estimate on how long something will take (wide estimates are perceived as bad estimates by most, as many of the comments on that blog show intentionally or unintentionally), but it is fun, and I've seen it be personally useful before.

comment by bentarm · 2010-12-20T23:02:47.027Z · LW(p) · GW(p)

8/10, and, as mentioned in the explanation on the second blog post, I did feel like I was giving ridiculously wide ranges for some of them, turns out I wasn't ridiculous enough - as one commenter points out, a more interesting exercise might be to do the same but with 50% ranges - then the simple solution of "make an estimate and put 3 orders of magnitude each side" doesn't help.

Also, the length of the coastline of the Pacific, which is one of the ones I got right is, as HonoreDB and several commentors on the blog post have pointed out, undefined - it depends what length ruler you use.

comment by prase · 2010-12-20T22:35:23.747Z · LW(p) · GW(p)

Damn, I have been correct only 6 times out of 9 (I did understand one of the questions incorrectly). And I knew both that people are overconfident and that even when told about that they don't compensate sufficiently.

A fun estimation test, is it useful?

Contents

50 comments