Bayesian examination

post by Lê Nguyên Hoang (le-nguyen-hoang-1) · 2019-12-09T19:50:53.348Z · score: 85 (37 votes) · LW · GW · 54 comments

Contents

  Let's take an example
  Why naive notation is bad
  Incentive-compatible scoring rules
  Why this is important
  Where to go from here
None
54 comments

A few months ago, Olivier Bailleux, a Professor of computer science and reader of my book on Bayesianism, sent me an email. He suggested to apply some of the ideas of the book to examine students. He proposed Bayesian examination.

I believe it to be a brilliant idea, which could have an important impact on how many people think. At least, I think that this is surely worth sharing here.

tl;dr Bayesian examinations seem very important to deploy because they incentivize both probabilistic thinking and intellectual honesty. Yet, as argued by Julia Galef in this talk, incentives seem critical to change our thinking habits.

Let's take an example

Where is the International Olympic Committee?
1. Geneva
2. Lausanne
3. Zurich
4. Lugano

Quite often, students are asked to select one of the four possible answers. But this is arguably pretty bad, for several reasons:
- It makes impossible to distinguish a student who has a hunch from a student who really studied and knew the answer.
- It gives students the habit of self-identifying with a single answer.
- It normalizes deterministic question answering.
- It motivates students to defend the answer they gave (which encourages the motivated reasoning fallacy...).

Instead, Bayesian examination demands that students provide probabilistic answers. In other words they will have to provide percentage for each answer.

In our case, a student, call her Alice, might thus answer
1. 33%
2. 33%
3. 33%
4. 1%
Alice would essentially be formalizing the sentence "I really don't know but I would be very surprised if Lugano was the right answer".

Another student, let's call him Bob, might answer
1. 5%
2. 40%
3. 50%
4. 5%
Bob might be having in mind something like "I know that FIFA and CIO are in Zurich and Lausanne, but I don't remember which is where; though Zurich is larger so it would make sense for CIO to be in Zurich rather than Lausanne".

Spoiler: the answer turns out to be Lausanne.

Why naive notation is bad

Now, how would such an exam be scored? One intuitive idea could be that Alice should thus get 0.33 points, while Bob should get 0.4 points. Denoting the probability assigned by a student to answer , and the right answer, this would correspond to giving the student a score equals to .

This would not be a great idea though. The reason for this has to do with incentives. Indeed, it turns out that if the above figures are the credences of Alice and Bob, then Alice and Bob would be incentivized to maximize their expected scores . It turns out that this maximization leads to the following answers.

For Alice:
1. Credence in Geneva, but answers .
2. Credence in Lausanne, but answers .
3. Credence in Zurich, but answers .
4. Credence in Lugano, but answers .

For Bob:
1. Credence in Geneva, but answers .
2. Credence in Lausanne, but answers .
3. Credence in Zurich, but answers .
4. Credence in Lugano, but answers .

In other words, this naive scoring incentivizes the exaggeration of beliefs towards deterministic answer. This is very, very, very, very, very bad (sorry I'm a bit of Bayesian extremist!). This favors polarization, rationalization, groupism and so many other root causes of poor debating.

Indeed, while students may not find out consciously that this exaggeration strategy is optimal, we should expect them to eventually try it and not unconsciously notice that this is not so bad. In particular, this prevents them from valuing the extra-effort of probabilistic thinking.

Fortunately, there are better scoring rules.

Incentive-compatible scoring rules

An incentive-compatible scoring rule is called a proper scoring rule. But I'm not keen on the terminology, as it's not transparent, so I'll stick with incentive-compatible scoring rule. Such incentive-compatible scoring rules are such that truth-telling (or rather "credence-telling") is incentivized.

There are several incentive-compatible scoring rules, like the logarithmic scoring rule () or the spherical scoring rule (). But I think that the most appropriate one may be the quadratic scoring rule, because it is the simplest and easiest for students to verify.

In our case, given that the right answer was Lausanne, the score of a student who answered and is . In other words, for each possibility , the student loses the square of the distance between his answer and the true answer (0% or 100%).

In our case, Alice would win points, while Bob would win points. Of course, the right answer would win 1 point, while any maximally wrong answer like would lose 1 point.

Perhaps more interestingly, a maximally ignorant student who answers 25% to each possibility would win points. This is much better than the expected answer of random deterministic guess, which equals . Exaggerated guesses get greatly penalized. In fact, they yield negative points!

Formally, the quadratic scoring rule equals , where is the basis vector whose entries are all zeros except for the -th coordinate, which is 1. If there are answers, then the maximally ignorant student wins , while the random deterministic guesser wins an expectation of .

Note also that $E[S(q)|p] = ||p||_2^2 - ||q-p||_2^2$, where $p$ is the credence and $q$ is the answer. This is clearly minimal for $q=p$. In fact, interestingly, it is minimal even if we allow $q \in \mathbb R^n$ (i.e. even if we don't tell students that their probabilistic answers need to add up to 1, then they will eventually learn that this is the way to go). In particular, the honest answer yields an expected score of $E[S] = ||p||_2^2$, which indeed reflects the uncertainty of the student.


Why this is important

Because wrong answers are much more penalized than acknowledging ignorance, students who aim to maximize their scores will likely eventually learn, consciously or not, that guessing deterministic answers is just wrong. They may even learn the habit of second-guessing their intuitions, and to add uncertainty to their first guesses. In terms of rationality, this seems like a huge deal!

Perhaps equally importantly, such Bayesian examinations incentivize students to take on probabilistic reasoning. Students may thereby learn to constantly measure appropriately their levels of confidence, and to reason with (epistemological) uncertainty. As an aspiring Bayesian, this is the part I'm most excited about!

Finally, and probably even more importantly, such examinations incentivize intellectual honesty. This is the habit of trying to be honest, not only with others, but also with ourselves. It's sometimes said that "a bet is a tax on bullshit", as argued by Alex Tabarrok. Arguably, Bayesian examinations are even better than a bet. Indeed, in (important) exams, we might be making an even bigger effort than when we put our money where our mouth!

In case you're still not convinced by the importance of intellectual honesty, I highly recommend this talk by Julia Galef or her upcoming book (as well as, say, Tetlock and Gardner's Superforecasting book).

Where to go from here

I haven't had the chance to test these ideas though. I wonder how students and teachers will feel about it. I suspect some pushback early on. But I would also bet that students may eventually appreciate it. To find out, I guess this really needs to be tested out there!

One particular platform that could be a great first step is in MOOCs and other online websites where people enter their answers electronically. If you happen to be working in such areas, or to know people working in these areas, I think it would be great to encourage a trial of Bayesian examinations! And if you do, please send me feedbacks. And please let me test your exams as well :P

Still another approach would be to develop an app to record Bayesian bets that we make, and to compute our incentive-compatible (quadratic?) scores. Gamifying the app might make it more popular. If anyone is keen on developing such an app, I'd be more than eager to test it, and to train my own Bayesian forecasting abilities!

PS : If you're French-speaking (or motivated to read subtitles), you can also check out the video I made on the same topic.


54 comments

Comments sorted by top scores.

comment by Vaniver · 2019-12-11T08:25:16.533Z · score: 16 (9 votes) · LW(p) · GW(p)

A class I took in graduate school worked this way; here's the professor's paper about it. Some notes on how it worked:

  1. He used the logarithmic scoring rule, and normalized it such that a maxent guess was 0 points. 
  2. It takes students a while to learn calibration, and so it's worth doing many small-stakes versions of this before doing large-stakes versions of it. (The way he did this--one question as a homework assignment each week, and then one or two large exams--didn't do all that well for this, especially since the homework assignments didn't fully replicate the "how well can I interpret the question without asking for clarification?" part of the uncertainty that was relevant on tests.)
  3. Getting probabilities from the students lets you generate average probabilities for each answer, which is actually quite useful at figuring out where the class is confused. Importantly, you can tell the difference between a question where the average estimate on the right answer is 90% and one where the average estimate on the right answer is 50%, even though both of those will look almost identical in the world where students only choose their top answer!
comment by Panashe Fundira (panashe-fundira) · 2019-12-11T21:24:59.036Z · score: 2 (2 votes) · LW(p) · GW(p)

As a student, did you experience any particular frustrations with this approach?

comment by Vaniver · 2019-12-12T06:14:24.512Z · score: 8 (4 votes) · LW(p) · GW(p)

I mean, I personally was quite overconfident on the first midterm. ;) The primary reason was explicitly thinking it through and deciding that I wasn't risk-neutral when it came to points; I cared more about having 'the highest score' than maximizing my expected score.

It also takes a bit longer to process questions; rather than just bubbling in a single oval, you have to think about how you want to budget your probability for each question, and it's slightly harder for the teacher to process answers to get grades. But I think it more than pays for itself in the increased expressiveness. 

comment by ChristianKl · 2019-12-12T09:49:55.025Z · score: 2 (1 votes) · LW(p) · GW(p)

It being harder for the teacher to process seems to be a feature of bad software support. Ideally you would want to automate the whole process.

comment by Vaniver · 2019-12-12T19:05:15.756Z · score: 3 (1 votes) · LW(p) · GW(p)

If you have a digital exam, this works fine; if you want students to write things with pencil and paper, then you need to somehow turn the pencil marks into numbers that can be plugged into a simple spreadsheet. 

comment by Dr_Manhattan · 2019-12-09T22:24:52.982Z · score: 9 (4 votes) · LW(p) · GW(p)

Ross https://web.stanford.edu/~shachter/ uses something like this to score answers to (appropriately) Decision Analysis homework questions. (Don't remember the exact rule, but the intent was the same)

comment by Dr_Manhattan · 2019-12-11T13:25:47.813Z · score: 2 (1 votes) · LW(p) · GW(p)

I'll claim LW priority for pointing to the idea (but not to elaborating it in a post) https://www.lesswrong.com/posts/isSMDR8rMr5pTzJK5/example-of-poor-decision-making-under-pressure-from-game?commentId=YzvDcA357NboxD2fE [LW(p) · GW(p)] :)

comment by Unnamed · 2019-12-10T02:57:43.212Z · score: 2 (1 votes) · LW(p) · GW(p)

I recall hearing about classes at Carnegie Mellon (in the Social and Decision Sciences department) which gave exams in this sort of format.

comment by Korz · 2019-12-11T22:50:49.657Z · score: 8 (3 votes) · LW(p) · GW(p)

Tangentially relevant: I think that adopting Bayesian examination widely in society would decrease the number of people with aversion to maths/science/lawful thinking:

In my personal experience, thinking in probabilities feels much more natural* than 'hard' true-false thinking. I think that this aspect of lawful thinking plays an important role in many people deciding that "maths/science/... is not for me" and creating an Ugh field [LW · GW] around them, and I think that Bayesian examinations as a default for examinations would be likely to shift the general opinion towards feeling comfortable with lawful thinking.

____

*: in the sense of "I can apply this kind of thinking also without using 'my logic-module'"; "Universal law" of the sequences [? · GW]has as a main point that most human thinking is based on leaky abstractions, which are very compatible with probabilistic reasoning

comment by Pattern · 2019-12-12T17:48:09.879Z · score: 1 (1 votes) · LW(p) · GW(p)

Multiple approaches might help - if a subject is predominantly taught one way, then people who don't learn well that way seem worse off.

Being comfortable with a subject might require being good at it. Being good at it might require practice. Practice might require not fearing failure*. (A sense of play is ideal. That which is self driven can last - it will be practiced and retained, or relearned if not, when it proves to be missing. Absent practice things aren't learned once, but multiple times.)

*Ugh fields (aversion to X) are built from smaller ugh fields (aversion to failure), in this model. Given the impact of spaced repetition on learning "information", repetition might have something to do with "aversions" - they are learned.

comment by ESRogs · 2019-12-09T22:51:51.501Z · score: 8 (4 votes) · LW(p) · GW(p)

I like this. One suggestion: give students a small, predefined set of credences they're allowed to use, so they don't have to spend time deciding between e.g. 33% and 34%.

I'd suggest: 10%, 25%, 50%, 75%, and 90%, or maybe just all the deciles.

(Note: I believe it's not critical w/ the given scoring rule that the probabilities sum exactly to 100%. So you might end up guessing 10, 10, 25, 50, and that's okay. Someone please correct me if this is wrong.)

comment by PeterMcCluskey · 2019-12-11T05:28:36.078Z · score: 6 (3 votes) · LW(p) · GW(p)

I'm unclear on the benefits. I can see how it's sometimes faster, but I'm unclear why faster is an important criterion, and it will sometimes be slower: if I quickly generated the first example in the post of 33,33,33,1, then I'd likely slow down a good deal trying to decide which of those 33s to turn into a 30 and which to turn into a 40.

Whereas it seems clearly valuable to encourage a habit of having the probabilities sum to 100%. That's not for the scoring rule, it's for developing good intuitions.

comment by ESRogs · 2019-12-11T22:38:33.232Z · score: 3 (2 votes) · LW(p) · GW(p)
if I quickly generated the first example in the post of 33,33,33,1

If you're doing a test with a bunch of these questions, and you have the same probabilities to choose from every time, then I'd expect you to quickly generate those numbers instead, and not have this problem.

comment by aaq · 2019-12-10T17:38:58.842Z · score: 3 (2 votes) · LW(p) · GW(p)

You could normalize the scoring rule back to 1, so that should be fine.

comment by Pattern · 2019-12-10T05:44:30.631Z · score: 3 (2 votes) · LW(p) · GW(p)

Probabilities could be done with fractions (though there's the risk of error). Odds could also be given.

comment by Alexei · 2019-12-09T21:54:31.464Z · score: 6 (3 votes) · LW(p) · GW(p)

Wow, this seems huge to me! Great find! I’m not in the business of constructing tests, but if I ever do any home schooling with my children, I’ll definitely try this method.

comment by Korz · 2019-12-11T21:46:55.978Z · score: 5 (3 votes) · LW(p) · GW(p)
It turns out that this maximization leads to the following answers.
For Alice:
1. Credence p1=33% in Geneva, but answers q1=100%.
2. Credence p2=33% in Lausanne, but answers q2=0%.
3. Credence p3=33% in Zurich, but answers q3=0%.
4. Credence p4=33% in Lugano, but answers q4=0%.

I am surprised by these numbers:

i) I assume that p4=33% and not p4=1% is a typo?

ii) Also, when reading that q1=100%, while q2,q3= 0%, I was surprised. As p1,p2 and p3 are the same, (if I am not mistaken) Alice should be free to arbitrarily divide her probability mass between these three? Given that, I expected her to choose q1=q2=q3. In case others were confused by this detail too, it might be worth it to slightly complicate the example (along the lines of 'Alice remembers an ambitious athlete friend being invited to Geneva once' and using this as tie breaker for the honest probabilities)


comment by Bucky · 2019-12-10T22:39:00.672Z · score: 5 (3 votes) · LW(p) · GW(p)

Great post, Id be really interested to hear how this goes down with students.

I would be cautious about using information from incorrect answers to calculate the score - just using the percentage given for the correct answer still gives a proper scoring rule. If percentages placed on incorrect answers are included then you get 50:25:25:0 giving more points than 50:50:0:0 when the answer is A which I think people might find hard to swallow.

For a proper scoring rule I find a particular framing of a log score to be intuitive - instead of adding the logs of the probabilities placed on the correct answers, just multiply out the probabilities.

This can be visualised as having a heap of points and having to spread them all across the 4 possible answers. You lose the points that were placed on the wrong answers and then use your remaining points to repeat the process for the next question. Whoever has the most points left at the end has done the best. The £100k drop is a game show which is based on this premise.

I personally find this to be an easy visualisation with the added benefit that the scores have a specific Bayesian interpretation - the ratio of students’ scores represent the likelihood function of who knows the subject best based on the evidence of that exam.

comment by Gurkenglas · 2019-12-11T02:45:10.250Z · score: 3 (2 votes) · LW(p) · GW(p)

Note that linear utility in money would again incentivize people to put everything on the largest probability.

comment by Bucky · 2019-12-11T07:21:28.964Z · score: 2 (1 votes) · LW(p) · GW(p)

This is true if scores from different questions are added but not if they are multiplied. Linear scoring with multiplication is exactly the same as log scoring with addition, just easier to visualise (at least to me)

comment by Gurkenglas · 2019-12-11T09:17:48.491Z · score: 2 (1 votes) · LW(p) · GW(p)

Wrong. In the 100k drop, if you know each question has odds 60:40, expected winnings are maximized if you put all on one answer each time, not 60% on one and 40% on the other.

What's not preserved between the two ways to score is which strategy maximizes expected score.

comment by Bucky · 2019-12-11T12:21:34.489Z · score: 3 (2 votes) · LW(p) · GW(p)

I think the 100k drop analogy may be misleading when thinking about the final result. The final score in the version I envisage is judged on ratios between results, rather than absolute values (my explanation maybe isn't clearly enough on this). In that case putting everything on the answer which you have 60% confidence in and being right gives a ratio of 1.67 in your favour over an honest reporting. But if you do it and get it wrong then there is an infinite ratio in favour of the honest reporting.

comment by ChristianKl · 2019-12-11T14:12:36.163Z · score: 2 (1 votes) · LW(p) · GW(p)
If percentages placed on incorrect answers are included then you get 50:25:25:0 giving more points than 50:50:0:0 when the answer is A which I think people might find hard to swallow.

It seems to me like a good lesson.

It seems to me like it's a coin toss between A and B (C and D are impossible) is validly scored less well then A is twice as likely as B and A is twice as likely as C.

comment by Bucky · 2019-12-11T14:32:33.334Z · score: 2 (1 votes) · LW(p) · GW(p)

Why?

50:50:0:0 says it's a coin toss between A and ¬A. If ¬A then B.

50:25:25:0 says it's a coin toss up between A and ¬A. If ¬A then its a coin toss between B and C.

Why should the scoring rule care about what my rule is for ¬A when A is the correct answer?

I'm genuinely curious - I notice you're the second person [LW(p) · GW(p)] to voice this opinion but I can't get my head round it at all.

(As with my reply to aaq, this all assumes that these are genuine confidence levels)

comment by ChristianKl · 2019-12-11T20:27:23.326Z · score: 4 (2 votes) · LW(p) · GW(p)

I did quite a bunch of credence calibration, I'm curious to what extend you have trained it. I have the impression that my intuitions here are informed by experiences.

Most normal testing focuses on finding the correct answer. Credence training focuses on accurately having a sense of your own knowledge.

There's a sense that having that one's genuine confidence level is zero feels repulsing given that zero isn't a probability.

Humans use the availability heuristic to compare different options. The fact that A feels more likely then B and A feels more likely then C seems to me like a form of knowledge that's worth rewarding.


comment by Bucky · 2019-12-11T23:29:30.712Z · score: 2 (1 votes) · LW(p) · GW(p)

I have done some credence training but I think my instincts here are more based on Maths and specifically Bayes (see this comment [LW(p) · GW(p)]).

I think the zero probability thing is a red herring - replace the 0s with and the 50s with 50- and you get basically the same thing. There are some questions where keeping track of the just isn't worth it.

A proper scoring rule is designed to reward both knowledge and accurate reporting of credences. This is achieved if we score based on the correct answer, whether or not we also score based on the probabilities of the wrong answers.

If we also attempt to optimise for certain ratios between credences of different answers then this is at the expense of rewarding knowledge of the correct answer.

If Alice has credence levels of 50:50:: and Bob has 40:20:20:20 and the correct answer is A then Bob will get a higher score than Alice despite her putting more of her probability mass on the correct answer.

Do you consider this a price worth paying to reward having particular ratios between credences?

comment by J.D. · 2019-12-12T11:24:26.426Z · score: 1 (1 votes) · LW(p) · GW(p)

[....is at the expense of rewarding knowledge of the correct answer.]

Hmm... I'm not sure that Alice has really more knowledge than Bob in your example.

[EDIT : In fact, in your example, for the quadratic scoring rule, the score of 50:50:$\epsilon:\epsilon$ is better than the score of 40:20:20:20 since $12/25 < 1/2 + 2\epsilon^2$, so that we can indeed say that Alice has more knowledge than Bob after this rule. The following example is, IMHO, more interesting. /EDIT].

Let me propose an other perspective with the following two answers for propositions A:B:C:D :

1) 50:50:0:0

2) 50:25:25:0,

where the correct answer is 100:0:0:0.

In this case, 2) has a better score than 1).

What does 1) know ? That D and C are false. He knows nothing for A and B.

What does 2) know ? That D is false. That C is not very probable. He does not know for A, like 1). But he does know moreover that B is probably not the right answer.

Suppose that 3) is someone who knows that D and C are false, and also knows that B is probably not the right answer (i.e., 3 has the knowledge of both 1 and 2). Then 3) could have given the answer 3a) 75:25:0:0, or the answer 3b) 62,5:37,5:0:0. These two answers score better than 1) and 2).

(Note that knowing that D and C are false and that B is probably not the right answer influence your knowledge about A.)

-----

For example, imagine that 2) first thinks of 50:25:25:0, but then he remembers that it can in fact not be C. We can then compute the bayesian update, and we get :

  • P(A | non C) = 2/3 (vs P(A) = 1/2)
  • P(B | non C) = 1/3 (vs P(B) = 1/4)
  • P(C | non C) = 0 (vs P(C) = 1/4)
  • P(D | non C) = 0 (vs P(D) = 0).

This is different from answer 1). In this sense, I think we can really say that 2) knows something that 1) does not know, even if 2) is not sure that C is false. Indeed, after an update of the information 'non C', the score of 2) becomes better than the score of 1). (2/3:1/3:0:0 has a better score than 1/2:1/4:1/4:0).

comment by gjm · 2019-12-12T16:26:43.722Z · score: 4 (2 votes) · LW(p) · GW(p)

Sure, 2 knows something 1 doesn't; e.g., 2 knows more about how unlikely B is. But, equally, 1 knows something 2 doesn't; e.g., 1 knows more than 2 about how unlikely C is.

In the absence of any reason to think one of these is more important than the other, it seems reasonable to think that different probability assignments among the various wrong answers are equally meritorious and should result in equal scores.

... Having said that, here's an argument (which I'm not sure I believe) for favouring more-balanced probability assignments to the wrong answers. We never really know that the right answer is 100:0:0:0. We could, conceivably, be wrong. And, by hypothesis, we don't know of any relevant differences between the "wrong" answers. So we should see all the wrong answers as equally improbable but not quite certainly wrong. And if, deep down, we believe in something like the log scoring rule, then we should notice that a candidate who assigns a super-low probability to one of those "wrong" answers is going to do super-badly in the very unlikely case that it's actually right after all.

So, suppose we believe in the log scoring rule, and we think the correct answer is the first one. But we admit a tiny probability h for each of the others being right. Then a candidate who gives probabilities a,b,c,d has an expected score of (1-3h) log a + h (log b + log c + log d). Suppose one candidate says 0.49,0.49,0.01,0.01 and the other says 0.4,0.2,0.2,0.2; then we will prefer the second over the first if h is bigger than about 0.0356. In a typical educational context that's unlikely so we should prefer the first candidate. Now suppose one says 0.49,0.49,0.01,0.01 and the other says 0.49,0.25,0.25,0.01; we should always prefer the second candidate.

None of this means that the Brier score is the right way to prefer the second candidate over the first; it clearly isn't, and if h is small enough then of course the correction to the naive log score is also very small, provided candidates' probability assignments are bounded away from zero.

In practice, hopefully h is extremely small. And some wrong answers will be wronger than others and we don't want to reward candidates for not noticing that, but we probably also don't want the extra pain of figuring out just how badly wrong all the wrong answers are, and that is my main reason for thinking it's better to use a scoring rule that doesn't care what probabilities candidates assigned to the wrong answers.

comment by gjm · 2019-12-17T15:40:13.717Z · score: 2 (1 votes) · LW(p) · GW(p)

Arguably the "natural" way to handle the possibility that you (the examiner) are in error is to score answers by (negative) KL-divergence from your own probability assignment. So if there are four options to which you assign probabilities p,q,r,s and a candidate says a,b,c,d then they get p log(a/p) + q log(b/q) + r log(c/r) + s log(d/s). If p=1 and q,r,s=0,0,0 then this is the same as giving them log a, i.e., the usual log-scoring rule. If p=1-3h and q,r,s=h,h,h then this is (1-3h) log (a/(1-3h)) + h log(b/h) + ..., which if we fix a is constant + h (log b + log c + log d) = constant + h log bcd, which by the AM-GM inequality is biggest when b=c=d.

This differs from the "expected log score" I described above only by an additive constant. One way to describe it is: the average amount of information the candidate would gain by adopting your probabilities instead of theirs, the average being taken according to your probabilities.

comment by Bucky · 2019-12-12T21:04:50.342Z · score: 2 (1 votes) · LW(p) · GW(p)

This is really interesting, thanks, not something I'd thought of.

If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I'd create a fair scoring rule for this.

[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.

comment by Bucky · 2019-12-12T20:41:14.503Z · score: 2 (1 votes) · LW(p) · GW(p)

The score for the 50:50:0:0 student is:

The score for the 40:20:20:20 student is:

I think the way you've done it is Briers rule which is (1 - the score from the OP). In Briers rule the lower value is better.

comment by aaq · 2019-12-11T12:09:13.012Z · score: 0 (2 votes) · LW(p) · GW(p)

I disagree with your first point, I consider the 50:25:25:0 thing is the point. It's hard to swallow because admitting ignorance rather than appearing falsely confident always is, but that's why it makes for such a good value to train.

comment by Bucky · 2019-12-11T13:53:15.288Z · score: 3 (2 votes) · LW(p) · GW(p)

But if I my genuine confidence levels are 50:50:0:0 it seems unfair that I score less than someone whose genuine confidence levels are 50:25:25:0 - we both put the same probability on the correct score so why do they score more?

comment by philh · 2019-12-17T12:50:57.329Z · score: 4 (2 votes) · LW(p) · GW(p)

I don't think this is a knock-down argument, but:

Suppose the students are subsequently told, by someone whom they trust but who happens to be wrong, that the answer isn't A.

The 50:50:0:0 student says "okay, then it must be B". The 50:25:25:0 student says "okay, then it must be B or C, 50% on each". And the 50:17:17:17 student says "okay, then I don't know".

I don't think these responses are equally good, and I don't think they should be rewarded equally. The second student is more confused by fiction than the first, and the third is more confused again.

That said, to give a concrete example: what is 70*80? Is it 5600, 5400, 56000, or 3? By the way, it's not 5600.

Obviously the best response here is "um, yes it is". But I still feel like someone who gives equal weight to 3 as to 5400 is... either very confident in their skills, or very confused. I think my intuition is that I want to reward that student less than the other two, which goes against both your answer (reward them all equally) and my answer above (reward that student the most).

But I can't really imagine someone honestly giving 50:17:17:17 to that question. Someone who gave equal scores to the last three answers probably gave something like either 100:0:0:0 (if they're confident) or 25:25:25:25 (if they're confused), and gets a higher or lower reward from that. So I dunno what to make of this.

comment by Bucky · 2019-12-17T16:37:28.296Z · score: 2 (1 votes) · LW(p) · GW(p)

This makes sense to me.

I think to do this instead of preferring certain ratios between answers, we should prefer certain answers.

Under the original scoring scheme 50:50:0:0 doesn't score differently from 50:0:50:0 or 50:0:0:50. The average credence for each answer between those 3 is 50:17:17:17 so I'd argue that (without some external marking of which incorrect answers are more reasonable) 50:50:0:0 should score the same as 50:17:17:17.

However we could choose a marking scheme where you get back (using my framing of log scoring above):

100% of the points put on A

10% of the points put on B

10% of the points put on C

0% of the points put on D

That way 50:50:0:0 and 50:25:25:0 both end up with 55% of their points but 50:17:17:17 gets 53.4% and 50:0:0:50 gets 50%. Play around with the percentages to get rewards that seem reasonable - I think it would still be a proper scoring rule*. You could do something similar with a quadratic scoring rule.

*I think one danger is that if I am unsure but I think I can guess what the teacher thinks is reasonable/unreasonable then this might tempt me to alter my score based on something other than my actual credence levels.

comment by J.D. · 2019-12-11T20:37:18.968Z · score: 2 (2 votes) · LW(p) · GW(p)

[... why do they score more?]

I'm not sure if these are good reasons, but it seems to me that

1) The expected answer to the quiz does not just consist in identifying A as a correct answer but also in identifying the others as incorrect answers. I mean that the expected right answer is 100:0:0:0 (and not, for example, 100:50:0:0 or whatever else).

2) Giving 25:25 for B:C is better than giving 50:0 even if answer C is 0 since 25:25 is closer to 0:0 than 50:0 (for the usual Euclidean distance). In this perspective, a better answer for the 50:50:0:0's guy would have been 50:25:0:0, which is better than 50:25:25:0.

Indeed, 1 - [(1-1/2)^2 + (1/4)^2 + 0^2 + 0^2] > 1 - [(1-1/2)^2 + (1/4)^2 + (1/4)^2 + 0^2] > 1 - [(1-1/2)^2 + (1/2)^2 + 0^2 + 0^2].

3) With this perspective, I am indeed not sure that encouraging for a student's answer the sum to be 100 is a good idea. It seems better (for the student which is answering) to focus on each proposition (i.e., A, B, C or D) separately (related to point 1 of my message). For each proposition, the answer should reflect the credence of the person in the the fact that the answer is correct/incorrect. Therefore this could also be applied for a multiple-choice quiz with zero or more than one good answer(s).

EDIT (added) :

To sum up what I think could in this case be an answer to your question, I will say that, with the "quadratic scoring rule", if the expected answer for A:B:C:D is 100:0:0:0, then the answer 1) 50:25:0:0 scores more than the answer 2) 50:50:0:0 because they are both right for C and D, they are at the same distance of the expected answer for A but 1) is closer to the expected answer for B (which is 0) than 2).

The same reasoning works for comparing 1') 50:25:25:0 with 2') 50:50:0:0, except that in this second case, it is the general distance (for the quadratic scoring rule) of 25:25 (for B:C) which is closer to 0:0 than 50:0.

comment by Bucky · 2019-12-11T15:46:07.313Z · score: 3 (2 votes) · LW(p) · GW(p)

Maybe 1) is where I have a fundamental difference.

Given evidence A, a Bayesian update considers how well evidence A was predicted.

There is no additional update due to how well ¬A being false was predicted. Even if ¬A is split into sub-categories, it isn't relevant as that evidence has already been taken into account when we updated based on A being true.

r.e. 2) 50:25:0:0 gives a worse expected value than 50:50:0:0 as although my score increases if A is true, it decreases by more if B is true (assuming 50:50:0:0 is my true belief)

r.e. 3) I think it's important to note that I'm assuming that exactly 1 of A or B or C or D is the correct answer. Therefore that the probabilities should add up to 100% to maximise your expected score (otherwise it isn't a proper scoring rule).

comment by aaq · 2019-12-11T23:34:59.316Z · score: 1 (1 votes) · LW(p) · GW(p)

Try to think about this in terms of expected value. On your specific example, they do score more, but this is probabilistic thinking, so we want to think about it in terms of the long run trend.

Suppose we no longer know what the answer is, and you are genuinely 50/50 on it being either A or B. This is what you truly believe, you don't think there's a chance in hell it's C. If you sit there and ask yourself, "Maybe I should do a 50-25-25 split, just in case", you're going to immediately realize "Wait, that's moronic. I'm throwing away 25% of my points on something I am certain is wrong. This is like betting on a 3-legged horse."

Now let's say you do a hundred of these questions, and most of your 50-50s come up correct-ish as one or the other. Your opponent consistently does 50-25-25s, and so they end up more wrong than you overall, because half the time the answer lands on one of their two 25s, not their single 50.

It's not a game of being more correct, it's a game of being less wrong.

comment by Bucky · 2019-12-12T00:12:09.237Z · score: 2 (1 votes) · LW(p) · GW(p)

I think all of this is also true of a scoring rule based on only the probability placed on the correct answer?

In the end you'd still expect to win but this takes longer (requires more questions) under a rule which includes probabilities on incorrect answers - it's just adding noise to the results.

comment by ChristianKl · 2019-12-11T13:31:26.845Z · score: 4 (2 votes) · LW(p) · GW(p)

Is the quadratic scoring rule essentially the Briers score that Tetlock uses? If so, I think that's worth noting in your post given that many readers here are likely familiar with Tetlock's work.

comment by Sniffnoy · 2019-12-11T18:17:16.334Z · score: 4 (2 votes) · LW(p) · GW(p)

Yeah, proper scoring rules (and in particular both the quadratic/Brier and the logarithmic examples) have been discussed here a bunch, I think that's worth acknowledging in the post...

comment by matthewp · 2019-12-10T20:01:57.010Z · score: 3 (2 votes) · LW(p) · GW(p)

I like this idea generally.

Here is an elaboration on a theme I was thinking of running in a course:

If they could have a single yes / no question answered on the topic, what should most people ask?

The idea being to get people to start thinking about what the best way to probe for more information is when "directly look up the question's answer" is not an option.

This isn't something that can be easily operationalized on a large scale for examination. It is an exercise that could work in small groups.

One way to operationalize would be to construct the group average distribution, and score the question according to (0.5 - sum(mass of states mapping to true))^2. This only works (easily) for questions like, "Is the IOC in either of Geneva or Lugano?"

comment by aaq · 2019-12-10T17:45:39.112Z · score: 3 (1 votes) · LW(p) · GW(p)

Great post! Simple and useful. For spaced-repetition junkies in the crowd, I created a small Anki deck, created from this post to help me retain the basics.

27 cards: https://ankiweb.net/shared/info/187030147

comment by ChristianKl · 2019-12-12T09:48:00.814Z · score: 2 (1 votes) · LW(p) · GW(p)

Are there good freely licensed sets of multiple choice questions that an MVP app might use?

comment by Louis Faucon (louis-faucon) · 2019-12-17T18:37:53.258Z · score: 13 (5 votes) · LW(p) · GW(p)

Since I read this post I have implemented this small app:

I make apps only as a hobby, so it is not bug-free, scalable, or great. Feel free to send advice, comments, or requests.

Several similar apps exists which all had to solve the difficulty of making a set of interesting questions. I could make a small list if you are interested.

comment by Kontherad · 2020-01-25T15:56:59.629Z · score: 1 (1 votes) · LW(p) · GW(p)

I would repost this as a top-level comment for it to be able to gain visibility. Thanks for building this!

comment by Louis Faucon (louis-faucon) · 2020-02-06T21:27:07.187Z · score: 4 (3 votes) · LW(p) · GW(p)

I was waiting to make the app a bit better first. I made a post out of it today:

https://www.lesswrong.com/posts/7KRWCRBmvLhTMZB5Y/bayes-up-an-app-for-sharing-bayesian-mcq [LW · GW]

comment by jmh · 2019-12-12T17:17:37.327Z · score: 1 (1 votes) · LW(p) · GW(p)

It has been interesting reading the discussion here. I ultimately realized that for the most part this is exactly how I went about taking tests. If I know the answer to a questions I often did not even look at the other answers. When I was not sure I would evaluate to answers to identify what I could rule out and then, sometime, have to flip the mental coin.

comment by NaiveTortoise (An1lam) · 2019-12-11T21:47:05.057Z · score: 1 (1 votes) · LW(p) · GW(p)

If you really want to try and get traction on this, I'd recommend emailing Andrew Gelman (stats blogger and stats prof at Columbia). He's written previously (I can't seem to find the article unfortunately) about how statisticians should take their own ideas more seriously with respect to education and at the very least I can see him blogging about this.

comment by Pattern · 2019-12-10T05:40:36.231Z · score: 1 (1 votes) · LW(p) · GW(p)

This is an interesting idea. However, the post assumes that 1) there is (or should be) one correct answer, 2) which is of the form: (1, 0, 0, 0) or a permutation thereof, and 3) the material is independent of the system (does not include probability, for example). Implementing this reasonably might require figuring out how to apply the system if that isn't the case, and how to integrate with the underlying material in the event that the material isn't independent of probability.

1. will probably be an issue as a result of errors. (Like a multiple choice question where "the correct answer" is "dirt" and that's 2 of the 4 choices.)

2 and 3 are kind of the same.


Errata:

In other words, for each possibility i, the student loses the square of the distance between his answer qi and the true answer (0% or 100%).

the student loses the sum of the squares of the difference

Or just "the difference" (Euclidean norm).

Credence p4=[1%] in Lugano, but answers q4=0%.

The sum of the probabilities is 1.

this naive scoring incentivizes the exaggeration of beliefs towards deterministic answer.

Choosing the maximum probability, or randomizing if indifferent?


comment by rossry · 2019-12-11T13:51:22.544Z · score: 2 (2 votes) · LW(p) · GW(p)

However, the post assumes that 1) there is (or should be) one correct answer, 2) which is of the form: (1, 0, 0, 0) or a permutation thereof, and 3) the material is independent of the system (does not include probability, for example).

These are assumed for the sake of explanation, but none are necessary; in fact, the scoring rule and analysis go through verbatim if you have questions with multiple answers in the form of arbitrary vectors of numbers, even if they have randomness. The correct choice is still to guess, for each potential answer, your expectation of that answer's realized result.

comment by Sniffnoy · 2019-12-10T00:36:11.454Z · score: -5 (8 votes) · LW(p) · GW(p)

Kind of well-known here, but worth repeating I guess...

comment by ChristianKl · 2019-12-11T20:56:09.566Z · score: 2 (1 votes) · LW(p) · GW(p)

Usually we discussed them in the past for binary choices. I don't remember any previous discussion on using proper scoring rules for multiple choice tests.

comment by Gurkenglas · 2019-12-11T02:52:15.651Z · score: 2 (3 votes) · LW(p) · GW(p)

I agree. Proper scoring rules were introduced to this community 14 years ago.