comment by gjm ·
2017-01-30T12:51:04.403Z · LW(p) · GW(p)
I'm not a fan of the traditional method -- I am particularly unenthusiastic about the way it depends on allowing only a limited number of specific probability estimates -- but I could do with a little more information and/or persuasion before being convinced that this proposal is Doing It Right.
If I have one of your graphs, how do I (1) quantify (if I want to) how well/badly I'm doing and (2) figure out what I need to change by how much?
Consider the graph you took from Slate Star Codex (incidentally, you have a typo -- it says "Start" in your post at present). If I'm Scott looking at that graph, I infer that maybe I should trust myself a little more when I feel 70% confident of something, and that maybe I'm not distinguishing clearly between 70% and 80%; and that when I feel like I have just barely enough evidence for something to mention it as a 50% "prediction", I probably actually have a little bit more. And, overall, I see that across the board I'm getting my probabilities reasonably close, and should probably feel fairly good about that.
(Note just in case there should be the slightest doubt: I am not in fact Scott.)
On the other hand, if I'm Scott looking at this
which is, if I've done it right, the result of applying your approach to his calibration data ... well, I'm not sure what to make of it. By eye and without thinking much, it looks as if it gets steadily worse at higher probabilities (which I really don't think is a good description of the facts); since it's a cumulative plot, perhaps I should be looking at changes in the gap sizes, in which case it correctly suggests that 0.5 is bad, 0.6 isn't, 0.7 is, 0.8 isn't ... but it gives the impression that what happens at 0.9-0.99 is much worse than what happens at lower probabilities, and I really don't buy that. And (to me) it doesn't give much indication of how good or bad things are overall.
Do you have some advice on how to read one of your graphs? What should Scott make of the graph shown above? Do you think the available evidence really does indicate a serious problem around 0.9-0.99?
I also wonder if there's mileage in trying to include some sort of error bars, though I'm not sure how principled a way there is to do that. For instance, we might say "well, for all we know the next question of each type might have gone either way" and plot corresponding curves with 1 added to all the counts:
The way the ranges overlap at the right-hand side seems to me to back up what I was saying above about the data not really indicating a serious problem for probabilities near 1.
Replies from: SquirrelInHell