Bayes-Up: An App for Sharing Bayesian-MCQ

post by Louis Faucon (louis-faucon) · 2020-02-06T19:01:20.773Z · score: 43 (16 votes) · LW · GW · 6 comments

Inspired by Lê Nguyên Hoang [LW · GW]'s post on Bayesian Examination [LW · GW], I have been developing (as a hobby) a new app called Bayes-Up (available at: bayes-up.web.app). The app is now in a state where it is working well enough to be shared with others. In this post I list a few things you can do with it, because I expect that it will spark some interest within the community.

6 comments

Comments sorted by top scores.

comment by Louis Faucon (louis-faucon) · 2020-02-06T21:20:07.791Z · score: 5 (3 votes) · LW(p) · GW(p)

Here you can see a graph of calibration of a user (available in the app):

https://twitter.com/le_science4all/status/1225498307348377600

And here you can see graphs of calibration for some of the quizzes of the app:

https://twitter.com/le_science4all/status/1225527782647705600

They clearly show overconfidence in the answers of the participants.

comment by Bucky · 2020-02-07T10:50:58.820Z · score: 4 (3 votes) · LW(p) · GW(p)

This is great!

One question is how are the error bars are calculated? From the description they are standard errors I think but if that's the case then you wouldn't really expect to get all of the black dots within the bars even if you were perfectly calibrated - more like 70% of dots within 1SE?

I'm also a bit confused by why, when I get 100% of questions correct when using a certain percentage, 100% isn't within my error bar?

comment by Louis Faucon (louis-faucon) · 2020-02-13T16:48:36.097Z · score: 2 (2 votes) · LW(p) · GW(p)

You are right about the proportion of dots within the error bars. This sounds like something I would want to change.

100% is not within the error bar, because they are not exactly error bars, but bayesian estimations of where your true probability lies using a uniform prior between 0% and 100%. If I pick a coin which has a probability p of Head picked uniformly between 0% and 100%, then after observing 4 Heads out of 4 throws, you should still believe in average that the probability of Head is 80% ( = n_heads / (n_throws + 1) ) in average and a 75% confidence interval would not contain the probability 100%.

So you need to show more proofs that your 100% answers are indeed right 100% of the time. I agree this is confusing, and I want to change it for the better, but I am unsure how.

For all answers with probability p, I count the number of times it has been the right answer and a wrong answer. If anyone as a recommendation on how to compute the top and bottom percentage of the error bars from these, I would really appreciate it.

comment by Bucky · 2020-02-13T23:59:14.608Z · score: 5 (3 votes) · LW(p) · GW(p)

Thanks, I think I get it now.

If I observe 4 heads out of 4 and my prior was uniform across [0,1] then my posterior maximum likelihood is at 1 and this should definitely be within my error bars. Calculating the mean and adding symmetric error bars doesn’t work for asymmetric distributions.

To do this method more accurately you would have to calculate the full posterior distribution across [0,1] and use that to create error bars. Personally I would do this numerically but there may well be an analytical solution someone else will know about.

Alternatively, a frequentist approach: create error bars on the target percentage, rather than on the percentage achieved.

For each percentage grouping see how many questions had been answered using that percentage. Then use a binomial distribution to calculate the likelihood of each number of correct responses assuming that I am perfectly calibrated. This is essentially calculating a p-value with the null hypothesis being “I am perfectly calibrated”.

For example say I’ve answered 80% 4 times. If I’m perfectly calibrated I have a 0.8^4=41% chance of getting them all correct. Correspondingly I have:

0.8^3 x 0.2 x 4 = 41% to get 3 correct

0.8^2 x 0.2^2 x 6 = 15.4% to get 2 correct

0.8 x 0.2^3 x 4 = 2.5% to get 1 correct

0.2^4 = 0.2% to get 0 correct

If I am using a 90% CI (5% - 95%) then getting 0 correct is not inside my interval and nor is getting 1 correct (since 0.2% + 2.5% < 5%) but any of the other results are. So the top of my target error bar would reach to 100% and the bottom of would be between 25% and 50%

It is possible to combine all of the answers to create a single p-value across all percentages but this gets more complicated.

(Of course there would be 0 width error bars at 0% and 100% responses as any failures on these percentages are irrecoverable but this is right and proper)

comment by jacobjacob · 2020-02-07T07:14:02.266Z · score: 2 (1 votes) · LW(p) · GW(p)

So far about 30'000 questions have been answered by about 1'300 users since the end of December 2019.

That's a surprisingly high number of people. Curious where they came from?

comment by Lê Nguyên Hoang (le-nguyen-hoang-1) · 2020-02-07T07:35:43.832Z · score: 7 (6 votes) · LW(p) · GW(p)

I promoted Bayes-up on my YouTube channel a couple of times 😋 (and on Twitter)

https://www.youtube.com/channel/UC0NCbj8CxzeCGIF6sODJ-7A/