Rugby & Regression Towards the Mean 2019-10-30T16:36:00.287Z · score: 16 (4 votes)
Age gaps and Birth order: Reanalysis 2019-09-07T19:33:16.174Z · score: 49 (10 votes)
Age gaps and Birth order: Failed reproduction of results 2019-09-07T19:22:55.068Z · score: 63 (16 votes)
What are principled ways for penalising complexity in practice? 2019-06-27T07:28:16.850Z · score: 42 (11 votes)
How is Solomonoff induction calculated in practice? 2019-06-04T10:11:37.310Z · score: 35 (7 votes)
Book review: My Hidden Chimp 2019-03-04T09:55:32.362Z · score: 31 (13 votes)
Who wants to be a Millionaire? 2019-02-01T14:02:52.794Z · score: 29 (16 votes)
Experiences of Self-deception 2018-12-18T11:10:26.965Z · score: 16 (5 votes)
Status model 2018-11-26T15:05:12.105Z · score: 29 (10 votes)
Bayes Questions 2018-11-07T16:54:38.800Z · score: 22 (4 votes)
Good Samaritans in experiments 2018-10-30T23:34:27.153Z · score: 132 (54 votes)
In praise of heuristics 2018-10-24T15:44:47.771Z · score: 44 (14 votes)
The tails coming apart as a strategy for success 2018-10-01T15:18:50.228Z · score: 33 (17 votes)
Defining by opposites 2018-09-18T09:26:38.579Z · score: 19 (10 votes)
Birth order effect found in Nobel Laureates in Physics 2018-09-04T12:17:53.269Z · score: 61 (19 votes)


Comment by bucky on Blog Post Day (Unofficial) · 2020-02-18T20:22:07.870Z · score: 3 (2 votes) · LW · GW

This is a great idea; I’m definitely up for it.

Comment by bucky on How to Lurk Less (and benefit others while benefiting yourself) · 2020-02-17T22:09:38.353Z · score: 3 (2 votes) · LW · GW


Comment by bucky on Bayes-Up: An App for Sharing Bayesian-MCQ · 2020-02-13T23:59:14.608Z · score: 5 (3 votes) · LW · GW

Thanks, I think I get it now.

If I observe 4 heads out of 4 and my prior was uniform across [0,1] then my posterior maximum likelihood is at 1 and this should definitely be within my error bars. Calculating the mean and adding symmetric error bars doesn’t work for asymmetric distributions.

To do this method more accurately you would have to calculate the full posterior distribution across [0,1] and use that to create error bars. Personally I would do this numerically but there may well be an analytical solution someone else will know about.

Alternatively, a frequentist approach: create error bars on the target percentage, rather than on the percentage achieved.

For each percentage grouping see how many questions had been answered using that percentage. Then use a binomial distribution to calculate the likelihood of each number of correct responses assuming that I am perfectly calibrated. This is essentially calculating a p-value with the null hypothesis being “I am perfectly calibrated”.

For example say I’ve answered 80% 4 times. If I’m perfectly calibrated I have a 0.8^4=41% chance of getting them all correct. Correspondingly I have:

0.8^3 x 0.2 x 4 = 41% to get 3 correct

0.8^2 x 0.2^2 x 6 = 15.4% to get 2 correct

0.8 x 0.2^3 x 4 = 2.5% to get 1 correct

0.2^4 = 0.2% to get 0 correct

If I am using a 90% CI (5% - 95%) then getting 0 correct is not inside my interval and nor is getting 1 correct (since 0.2% + 2.5% < 5%) but any of the other results are. So the top of my target error bar would reach to 100% and the bottom of would be between 25% and 50%

It is possible to combine all of the answers to create a single p-value across all percentages but this gets more complicated.

(Of course there would be 0 width error bars at 0% and 100% responses as any failures on these percentages are irrecoverable but this is right and proper)

Comment by bucky on Bayes-Up: An App for Sharing Bayesian-MCQ · 2020-02-07T10:50:58.820Z · score: 4 (3 votes) · LW · GW

This is great!

One question is how are the error bars are calculated? From the description they are standard errors I think but if that's the case then you wouldn't really expect to get all of the black dots within the bars even if you were perfectly calibrated - more like 70% of dots within 1SE?

I'm also a bit confused by why, when I get 100% of questions correct when using a certain percentage, 100% isn't within my error bar?

Comment by bucky on Some quick notes on hand hygiene · 2020-02-06T21:53:53.674Z · score: 2 (1 votes) · LW · GW

This one?

I hadn’t realised this was an issue but I’m definitely going to be remembering this in future!

Comment by bucky on Hello, is it you I'm looking for? · 2020-02-05T10:28:34.247Z · score: 3 (2 votes) · LW · GW

I'm not sure this is an exact match to your question but it sounds like maybe what you're looking for is something like Solomonoff induction.

In Bayes the subjectivity come from choosing priors. Solomonoff induction includes an objective way to calculate the priors (see also Kolmogorov complexity). Unfortunately it isn't actually computable - I was asking a kind of similar question last year which has some answers about this.

I asked a follow-up question regarding complexity whose answers were super useful to my understanding of these kinds of things - particularly the sequence which johnswentworth wrote.

Comment by bucky on Protecting Large Projects Against Mazedom · 2020-02-04T21:28:39.587Z · score: 5 (2 votes) · LW · GW

That limit is usually about six direct reports per supervisor.

I think it’s important to say that if companies aimed for this limit then they would be very different places. In the upper-middle echelons where I work the average is around 3. This introduces an additional 2 layers vs having 6 direct reports.

I also suspect that the 6 limit makes sense for mazes. If you are in a non-maze and working hard on solutions 3-7 then this can be considerably higher. In a previous role I had 8 direct reports and still spent the vast majority of my time on object level work because managing people was easy - people enjoyed work and were committed to it so managing them required less input. I think 12-15 direct reports would have been tricky but achievable.

You will make mistakes; objective criteria (however flawed) will allow your boss to "fix the problem" by changing the criteria instead of by firing you.

I don’t think having subjective criteria makes you more likely to get fired - you just wouldn’t get into the problem in the first place unless you actually were underperforming. This of course assumes trust but in a non-maze this is less of an issue.

You will upset people with your hiring, promotion, and firing decisions; having an objective basis for your actions is extremely helpful to avoid or minimize lawsuits.

I think this is a good point, especially as you get bigger. Maybe there are creative ways to avoid this - the old zappos offer of $2000 to leave after a week comes to mind as a way to choose the right staff without objective criteria.

Comment by bucky on Potential Ways to Fight Mazes · 2020-01-29T21:03:01.954Z · score: 2 (1 votes) · LW · GW
This post considers eight potential ways to fight mazes in general

You're underselling yourself - you have 2 "Solution 5"s :)

Comment by bucky on Assortative Mating And Autism · 2020-01-29T10:18:02.570Z · score: 11 (2 votes) · LW · GW
It seems like p(highly analytical|on autism spectrum) is pretty high, but p(on autism spectrum|is highly analytical) might be much lower.

If we take:

p(on autism spectrum) = 2.5%

And defining:

p(highly analytical) = 10% (i.e. the top 10% most analytical people)

Then we get (via good ol' Bayes):

p(highly analytical|on autism spectrum) = 4 x p(on autism spectrum|is highly analytical)

For simplicity's sake, let:

p(highly analytical|on autism spectrum) = 80%


p(on autism spectrum|is highly analytical) = 20%

Which means that the probability that 2 randomly selected highly analytical people are both on the autism spectrum is 1 in 25 (compared to 1 in 1,600 in the general population).

1204 x 0.3 = 360 couples with children in the sample were both defined as highly analytical, so we have ~15 couples where both are on the autism spectrum. This may explain why any effect is too small to detect.

Obviously these are made up numbers and the actual boundaries of the categories are more fluid than that but this does seem like a plausible explanation.

Comment by bucky on 2018 Review: Voting Results! · 2020-01-24T23:26:40.785Z · score: 4 (2 votes) · LW · GW

Thanks, I was trying to work out a simple way to calculate how many times it would be worthwhile subtracting extra 1s - making the average close to 0 is a helpful simple rule.

Comment by bucky on 2018 Review: Voting Results! · 2020-01-24T09:12:24.508Z · score: 14 (7 votes) · LW · GW

I just realised my own voting (and I suspect that of most people) was inefficient.

Once I decided on all of my votes I should have decreased all of the votes by 1, including putting a -1 vote on any that I had previously been neutral on (ignoring the off-by-one error thing for the moment).

This wouldn't have changed the net effect of my vote but would have given me extra points to spend (the small cost of paying for negative results would have been more than offset by the large benefit from decreasing the positive votes).

I think most other people made the same mistake (well, it's a mistake if voting effect size was a high priority rather than, say, speed) due to the large number of neutral votes (~71%) and the ratio between positive vs negative votes (4.8 : 1) although both of these might have been effected somewhat by the off-by-one correction.

Comment by bucky on The Road to Mazedom · 2020-01-19T00:09:25.530Z · score: 6 (3 votes) · LW · GW
19. Occasionally an organization can successfully lower its maze level and change its culture, but this is expensive and rare heroic behavior. Usually this requires a bold leader and getting rid of a lot of people, and the old organization is effectively replaced with a new one, even if the name does not change. A similar house cleaning happens more naturally in the other direction when and as maze levels rise. 

I don't think there has been enough turnover of staff for this to actually be effective. However I would say that the difference between those executives hired before and after his arrival is noticeable. The number of levels has been reduced somewhat (10 -> 8 ish) and the structure simplified. As a result I think its kind of well known who you need to speak to if you actually want anything to get done.

So yeah, I'd say there's been progress but the culture of the organisation is still held back by those who are comfortable in the maze.

I should say that the company isn't nearly as bad as the worst described in the sequence but there are certainly departments within the company which feel very maze-like.

Comment by bucky on Conversational Cultures: Combat vs Nurture (V2) · 2020-01-18T21:46:49.332Z · score: 7 (3 votes) · LW · GW

Just to keep this up-to-date, I think V2 of this post addresses my concerns and I consider this an excellent fit for the 2018 review.

Comment by bucky on The Road to Mazedom · 2020-01-18T21:06:06.123Z · score: 5 (3 votes) · LW · GW

This post matches very strongly with my experiences both in a growing company attempting to resist becoming maze-like and in a larger company with a new CEO attempting to reduce maze-structure.

For the points r.e. nations do you have examples or are they just inferences (which, to be fair, seem reasonable)?

Comment by bucky on A method for fair bargaining over odds in 2 player bets! · 2020-01-13T00:14:59.717Z · score: 2 (1 votes) · LW · GW

Zvi made a reference post to the Kelly Criterion a while back which might be a good starting point.

Comment by bucky on A method for fair bargaining over odds in 2 player bets! · 2020-01-12T00:45:53.719Z · score: 3 (2 votes) · LW · GW

This is an interesting idea. Essentially you penalise dishonesty by making the pot smaller.

This works provided one player doesn’t predictably have a lower maximum bet and can then increase their maximum bet (and therefore the overall pot) while simultaneously misrepresenting their believed odds.

Did you consider using the Kelly Criterion for the mini-bets instead of using a flat rate? I’m not sure how this would affect the result but I suspect it might have some nice properties.

Comment by bucky on 2020's Prediction Thread · 2020-01-10T17:07:37.012Z · score: 2 (1 votes) · LW · GW

I recognised the humour and was responding in kind - specifically that if we are destroyed by aliens then I’m unlikely to be in a position to pay you what I owe...

Comment by bucky on Voting Phase of 2018 LW Review · 2020-01-09T23:18:38.882Z · score: 4 (2 votes) · LW · GW

Working perfectly now :)

Comment by bucky on Voting Phase of 2018 LW Review · 2020-01-09T21:29:05.905Z · score: 2 (1 votes) · LW · GW

The right hand side of the interface isn't working for me - the scroll bar looks like its scrolling down but the text doesn't move. I'm using Firefox.

Comment by bucky on Conversational Cultures: Combat vs Nurture (V2) · 2020-01-09T21:19:17.416Z · score: 4 (2 votes) · LW · GW

FWIW, I agree that it is good/important for mods to be able to state their own opinions freely. My only worry was that a book form of the review might lose this nuance if this is not stated explicitly.

Comment by bucky on 2020's Prediction Thread · 2020-01-08T21:16:19.824Z · score: 5 (1 votes) · LW · GW

I’m willing to place a large bet on 14 at 1000:1

If we are not destroyed by aliens then you owe me $1,000,000, if we are all destroyed by aliens then I owe you $1,000,000,000.

Comment by bucky on 2020's Prediction Thread · 2020-01-08T12:14:47.738Z · score: 2 (1 votes) · LW · GW

How are you defining “hand”?

Obviously this beats humans for speed but I guess you’re thinking of something which is general purpose and Rubik’s cube is just a test of dexterity?

Comment by bucky on Becoming Unusually Truth-Oriented · 2020-01-04T21:01:49.954Z · score: 2 (1 votes) · LW · GW

What seemed like a memory was actually an inference.

This is a great phrasing of something I notice happening to me.

My best solution so far is to notice that this might be a problem and admit it (even when I feel confident I’m right). This weakens the link between my social standing and whether the memory is correct, which lowers my bias towards remembering self-advantageously.

Comment by bucky on Might humans not be the most intelligent animals? · 2019-12-24T23:59:55.030Z · score: 5 (3 votes) · LW · GW

That’s really interesting. You can actually try a task which was used yourself.

Part of it seems to be that chimps are able to perform the task super fast - I can do the 9 number task on easy, on medium I can do it ok-ish and think if I kept practicing I’d be fairly consistent, but I don’t even have time to take all the numbers in on chimp speed.

I’m also not sure what to make of it. One possibility would be that chimps have an incredible short term memory (something like photographic) but that humans doing the same task have to rely on working memory. That would explain the speed at which they can take in all of the information.

Comment by bucky on Might humans not be the most intelligent animals? · 2019-12-24T14:35:58.240Z · score: 5 (3 votes) · LW · GW

Whilst I don’t think the thesis rests on it, it seemed like the strongest (and most surprising to me) evidence if it were true. It actually does provide some evidence that the gap isn’t as large as one might think.

You might want to edit the OP for anyone who doesn’t get round to reading the comments.

Comment by bucky on Might humans not be the most intelligent animals? · 2019-12-24T07:51:50.162Z · score: 16 (9 votes) · LW · GW

One example is the chimpanzee, which may have better working memory

This isn’t what the linked paper says. It claims that

Chimps + practice > humans


Humans + practice >> chimps + practice

From the abstract:

There is no evidence for a superior or qualitatively different spatial memory system in chimpanzees.

Comment by bucky on (Feedback Request) Quadratic voting for the 2018 Review · 2019-12-22T00:09:23.399Z · score: 4 (2 votes) · LW · GW

Throughout the OP my main question was what does Jameson think about this. It felt a bit odd to me that a specific voting method was being advocated without at least some of his input.

Comment by bucky on (Feedback Request) Quadratic voting for the 2018 Review · 2019-12-21T23:57:19.145Z · score: 2 (1 votes) · LW · GW

Slightly nitpicky:

The update from the prior isn’t quite right here. I would have to consider what probability I would have assigned to you having the opinion outlined in your comment if the idea was bad vs if the idea was good.

As you’re only saying 50% confidence it’s hard to distinguish good from bad so an update would probably be of a lesser magnitude and would naively not update in either direction. My actual update would be away from the extremes - it probably isn’t amazing but it probably isn’t terrible.

Comment by bucky on Counterfactual Mugging: Why should you pay? · 2019-12-19T13:55:53.310Z · score: 3 (2 votes) · LW · GW

There's an app for that

Comment by bucky on Is Causality in the Map or the Territory? · 2019-12-18T21:11:16.395Z · score: 4 (2 votes) · LW · GW

There would be an analogous example in hydraulics where positive displacement pumps are constant flow (~current) sources and centrifugal pumps are constant(ish*) pressure (~voltage) sources. The resistor would be a throttle.

In this case it is the underlying physical nature of the types of pumps which causes the effect rather than a feedback loop.

*At least at lower flow rates.

Comment by bucky on Bayesian examination · 2019-12-17T16:37:28.296Z · score: 2 (1 votes) · LW · GW

This makes sense to me.

I think to do this instead of preferring certain ratios between answers, we should prefer certain answers.

Under the original scoring scheme 50:50:0:0 doesn't score differently from 50:0:50:0 or 50:0:0:50. The average credence for each answer between those 3 is 50:17:17:17 so I'd argue that (without some external marking of which incorrect answers are more reasonable) 50:50:0:0 should score the same as 50:17:17:17.

However we could choose a marking scheme where you get back (using my framing of log scoring above):

100% of the points put on A

10% of the points put on B

10% of the points put on C

0% of the points put on D

That way 50:50:0:0 and 50:25:25:0 both end up with 55% of their points but 50:17:17:17 gets 53.4% and 50:0:0:50 gets 50%. Play around with the percentages to get rewards that seem reasonable - I think it would still be a proper scoring rule*. You could do something similar with a quadratic scoring rule.

*I think one danger is that if I am unsure but I think I can guess what the teacher thinks is reasonable/unreasonable then this might tempt me to alter my score based on something other than my actual credence levels.

Comment by bucky on Conversational Cultures: Combat vs Nurture (V2) · 2019-12-16T16:41:46.987Z · score: 13 (3 votes) · LW · GW

Most people who commented on this post seemed to recognise it from their experience and get a general idea of what the different cultures look like (although some people differ on the details, see later). This is partly because it is explained well but also because I think the names were chosen well.

Here are a few people saying that they have used/referenced it: 1, 2, 3 plus me.

From a LW standpoint thinking about this framing helps me to not be offended by blunt comments. My family was very combat culture but in life in general I find people are unwilling to say “you’re wrong” so it now comes as a bit of a shock. Now when someone says something blunt on LW I just picture it being said by my older brother and realise that probably no offense is meant.

Outside of LW, this post has caused me to add a bit into my induction of new employees at work. I encourage a fairly robust combat culture in my department but I realise that some people aren’t used to this so I try to give people a warning up front and make sure they know that no offense is meant.


There were a few examples in the comments where it seemed like the distinction between the two cultures wasn’t quite as clear as it could be.

Ruby updated the original distinction into two dimensions in a later comment – the “adversarial-collaborative” dimension and the “emotional concern and effort” dimension. The central combat culture was “adversarial + low emotional effort” and nurture was “collaborative + high emotional effort”. However there are cultures which fit in the other 2 possible pairings and the original framing suppressed that somewhat.

I personally would like to see a version of the OP which includes that distinction and think that it would likely be a good fit for the 2018 review. Short of making that distinction, the OP allows for people to fit, say, “collaborative + low emotional effort” into either the nurture or combat culture category. If the combat-nurture distinction is to be used as common knowledge then I worry that this will cause confusion between people with different interpretations.

My other worry about including this in the 2018 review is a claim of what the default should be. If the post claims that nurture culture should be the default, does that then seem like this is how LW should be? This counts even more as the post is by a member of the LW team.

Finally, in my mind if this post (or a version of it) was included in the 2018 review then it would benefit from including something like the excellent section which Ruby later moved into the comments. For the post it made sense to move it to the comments but it would be shame to miss it out entirely for the review.

Comment by bucky on Historical mathematicians exhibit a birth order effect too · 2019-12-16T11:26:49.911Z · score: 19 (6 votes) · LW · GW

I was going to write a longer review but I realised that Ben’s curation notice actually explains the strengths of this post very well so you should read that!

In terms of including this in the 2018 review I think this depends on what the review is for.

If the review is primarily for the purpose of building common knowledge within the community then including this post maybe isn’t worth it as it is already fairly well known, having been linked from SSC.

On the other hand if the review process is at least partly for, as Raemon put it:

“I want LessWrong to encourage extremely high quality intellectual labor.”

Then this post feels like an extremely strong candidate.

(Personal footnote: This post was essentially what converted me from a LessWrong lurker to a regular commentor/contributor - I think it was mainly just being impressed with how thorough it was and thinking that's the kind of community I'd like to get involved with.)

Comment by bucky on Moloch feeds on opportunity · 2019-12-13T22:04:53.410Z · score: 4 (2 votes) · LW · GW

I think that’s a good explanation. I agree that the solution to Akrasia I describe is kind of hacked together and is far from ideal. If you have a better solution to this I would be very interested and it would change my attitude to status significantly. I suspect that this is the largest inferential gap you would have to cross to get your point across to me, although as I mentioned I’m not sure how central I am as an example.

I’m not sure suffering is the correct frame here - I don’t really feel like Akrasia causes me to suffer. If I give in then I feel a bit disappointed with myself but the agent which wants me to be a better person isn’t very emotional (which I think is part of the problem). Again there may be an inferential gap here.

Comment by bucky on Moloch feeds on opportunity · 2019-12-13T12:51:57.541Z · score: 2 (1 votes) · LW · GW

This is a beautifully succinct way of phrasing it. I still have enough deontologist in me to feel a little dirty every time I do it though!

Comment by bucky on Moloch feeds on opportunity · 2019-12-13T10:47:13.055Z · score: 15 (10 votes) · LW · GW

Firstly, I'm with you on your model of status and the availability of perceived opportunity for additional status in a hyper-connected world is really interesting.

Where I have a big disagreement is in the lesson to take from this. Your argument is that we should essentially try to turn off status as a motivator. I would suggest it would be wiser to try to better align status motivations with the things we actually value.

I struggle hugely with akrasia. If I didn't have some external motivation then I'd probably just lie in bed all day watching tv. I don't know if I'm unusually susceptible to this but my impression is that this is a fairly common problem, even if to a lesser extent in some.

One of my solutions to this is to deliberately do things for the sake of status. Rather, I look for opportunities where me getting more status aligns with me doing things which I think are good.

As an example, take karma on LessWrong. This isn't completely analogous to status but every time I get karma I feel a little (or sometimes big!) boost of self-worth. If writing on LessWrong is aligned with my values then this is a good thing. If you add in a cash prize from someone respected in the community then my status circuit is triggered significantly to motivate me to write an answer even if the actual size of the cash prize doesn't justify the amount of time put in! [1] I could try to fight against this and not allow status triggers but I don't think that would actually improve my self-actualisation.

In a non LW context, if status in the eyes of my family is important, I won't just spend my time watching tv but will also spend time playing with my kids. I would play with my kids anyway as I know it's the right thing to do and is fun but on those occasions where tv is more appealing, listening to my status motivation can help me do the right thing while expending less will-power. [2]

On a practical level I'm not sure that trying to ban status motivations is practical. As you point out a status high is readily achievable elsewhere so if opportunities for status are banned within one community then this would just subconsciously motivate me to look elsewhere.

[1] This isn't a complaint!

[2] I am aware that confessing to this in most places would be seen as a huge social faux pas, I'm hoping LW will be more understanding.

Comment by bucky on Birth order effect found in Nobel Laureates in Physics · 2019-12-12T21:38:06.523Z · score: 18 (6 votes) · LW · GW

This is a review of my own post.

The first thing to say is that for the 2018 Review Eli’s mathematicians post should take precedence because it was him who took up the challenge in the first place and inspired my post. I hope to find time to write a review on his post.

If people were interested (and Eli was ok with it) I would be happy to write a short summary of my findings to add as a footnote to Eli’s post if it was chosen for the review.


This was my first post on LessWrong and looking back at it I think it still holds up fairly well.

There are a couple of things I would change if I were doing it again:

  • Put less time into the sons vs daughters thing. I think this section could have two thirds of it chopped out without losing much.
  • Unnamed’s comment is really important in pointing out a mistake I was making in my final paragraph.
  • I might have tried to analyse whether it is a firstborn thing vs an earlyborn thing. In the SSC data it is strongly a firstborn thing and if I combined Eli and my datasets I might be able to confirm whether this is also the case in our datasets. I’m not sure if this would provide a decisive answer as our sample size is much smaller even when combining the sets.
Comment by bucky on Bayesian examination · 2019-12-12T21:04:50.342Z · score: 2 (1 votes) · LW · GW

This is really interesting, thanks, not something I'd thought of.

If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I'd create a fair scoring rule for this.

[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.

Comment by bucky on Bayesian examination · 2019-12-12T20:41:14.503Z · score: 2 (1 votes) · LW · GW

The score for the 50:50:0:0 student is:

The score for the 40:20:20:20 student is:

I think the way you've done it is Briers rule which is (1 - the score from the OP). In Briers rule the lower value is better.

Comment by bucky on Bayesian examination · 2019-12-12T00:12:09.237Z · score: 2 (1 votes) · LW · GW

I think all of this is also true of a scoring rule based on only the probability placed on the correct answer?

In the end you'd still expect to win but this takes longer (requires more questions) under a rule which includes probabilities on incorrect answers - it's just adding noise to the results.

Comment by bucky on Unknown Knowns · 2019-12-11T23:48:27.954Z · score: 33 (7 votes) · LW · GW

Tldr; I don’t think that this post stands up to close scrutiny although there may be unknown knowns anyway. This is partly due to a couple of things in the original paper which I think are a bit misleading for the purposes of analysing the markets.

The unknown knowns claim is based on 3 patterns in the data:

“The mean prediction market belief of replication is 63.4%, the survey mean was 60.6% and the final result was 61.9%. That’s impressive all around.”

“Every study that would replicate traded at a higher probability of success than every study that would fail to replicate.”

“None of the studies that failed to replicate came close to replicating, so there was a ‘clean cut’ in the underlying scientific reality.”

Taking these in reverse order:

Clean cut in results

I don’t think that there is as clear a distinction between successful and unsuccessful replications as stated in the OP:

"None of the studies that failed to replicate came close to replicating"

This assertion is based on a statement in the paper:

“Second, among the unsuccessful replications, there was essentially no evidence for the original finding. The average relative effect size was very close to zero for the eight findings that failed to replicate according to the statistical significance criterion.”

However this doesn’t necessarily support the claim of a dichotomy – the average being close to 0 doesn’t imply that all the results were close to 0, nor that every successful replication passed cleanly. If you ignore the colours, this graph from the paper suggests that the normalised effect sizes are more of a continuum than a clean cut (central section b is relevant chart).

Eyeballing that graph, there is 1 failed replication which nearly succeeded and 4 successful which could have failed. If the effect size shifted by less than 1 S.D. (some of them less than 0.5 S.D.) then the success would have become a failure or vice-versa (although some might have then passed at stage 2). [1]

Monotonic market belief vs replication success

Of the 5 replications noted above, the 1 which nearly passed was ranked last by market belief, the 4 which nearly failed were ranked 3, 4, 5 and 7. If any of these had gone the other way it would have ruined the beautiful monotonic result.

According to the planned procedure [1], the 1 study which nearly passed replication should have been counted as a pass as it successfully replicated in stage 1 and should not have proceeded to stage 2 where the significance disappeared. I think it is right to count this as an overall failed replication but for the sake of analysing the market it should be listed as a success.

Having said that, the pattern is still a very impressive result which I look into below.

Mean market belief

The OP notes that there is a good match between the mean market belief of replication and the actual fraction of successful replications. To me this doesn’t really suggest much by way of whether the participants in the market were under-confident or not. If they were to suddenly become more confident then the mean market belief could easily move away from the result.

If the market is under-confident, it seems like one could buy options in all the markets trading above 0.5 and sell options in all the ones below and expect to make a profit. If I did this then I would buy options in 16/21 (76%) of markets and would actually increase the mean market belief away from the actual percentage of successful replications. By this metric becoming more confident would lower accuracy.

In a similar vein, I also don’t think Spearman coefficients can tell us much about over/under-confidence. Spearman coefficients are based on rank order so if every option on the market became less/more confident by the same amount, the Spearman coefficients wouldn’t change.

Are there unknown knowns anyway?

Notwithstanding the above, the graph in the OP still looks to me as though the market is under-confident. If I were to buy an option in every study with market belief >0.5 and sell in every study <0.5 I would still make a decent profit when the market resolved. However it is not clear whether this is a consistent pattern across similar markets.

Fortunately the paper also includes data on 2 other markets (success in stage 1 of the replication based on 2 different sets of participants) so it is possible to check whether these markets were similarly under-confident. [2]

If I performed the same action of buying and selling depending on market belief I would make a very small gain in one market and a small loss in the other. This does not suggest that there is a consistent pattern of under-confidence.

It is possible to check for calibration across the markets. I split the 63 market predictions (3 markets x 21 studies) into 4 groups depending on the level of market belief, 50-60%, 60-70%, 70-80% and 80-100% (any market beliefs with p<50% are converted to 1-p for grouping).

For beliefs of 50-60% confidence, the market was correct 29% of the time. Across the 3 markets this varied from 0-50% correct.

For beliefs of 60-70% confidence, the market was correct 93% of the time. Across the 3 markets this varied from 75-100% correct.

For beliefs of 70-80% confidence, the market was correct 78% of the time. Across the 3 markets this varied from 75-83% correct.

For beliefs of 80-100% confidence, the market was correct 89% of the time. Across the 3 markets this varied from 75-100% correct.

We could make a claim that anything which the markets show in the 50-60% range are genuinely uncertain but that for everything above 60% we should just adjust all probabilities to at least 75%, maybe something like 80-85% chance.

If I perform the same buying/selling that I discussed previously but set my limit to 0.6 instead of 0.5 (i.e. don’t buy or sell in the range 40%-60%) then I would make a tidy profit in all 3 markets.

But I’m not sure whether I’m completely persuaded. Essentially there is only one range which differs significantly from the market being well calibrated (p=0.024, two-tailed binomial). If I adjust for multiple hypothesis testing this is no longer significant. There is some Bayesian evidence here but not enough to completely persuade me.


I don’t think the paper in question provides sufficient evidence to conclude that there are unknown knowns in predicting study replication. It is good to know that we are fairly good at predicting which results will replicate but I think the question of how well calibrated we are remains an open topic.

Hopefully the replication markets study will give more insights into this.


[1] The replication was performed in 2 stages. The first was intended to have a 95% change of finding an effect size of 75% of the original finding. If the study replicated here it was to stop and ticked off as a successful replication. Those that didn’t replicate in stage 1 proceeded to stage 2 where the sample size was increased in order to have a 95% change of finding effect sizes at 50% of the original finding.

[2] Fig 7 in the supplementary information shows the same graph as in the OP but basing on Treatment 1 market beliefs which relate to stage 1 predictions. This still looks quite impressively monotonic. However the colouring system is misleading for analysing market success as the colouring system related to success after stage 2 of the replication but the market was predicting stage 1. If this is corrected then the graph look a lot less monotonic, flipping the results for Pyc & Rawson (6th), Duncan et al. (8th) and Ackerman et al. (19th).

Comment by bucky on Bayesian examination · 2019-12-11T23:29:30.712Z · score: 2 (1 votes) · LW · GW

I have done some credence training but I think my instincts here are more based on Maths and specifically Bayes (see this comment).

I think the zero probability thing is a red herring - replace the 0s with and the 50s with 50- and you get basically the same thing. There are some questions where keeping track of the just isn't worth it.

A proper scoring rule is designed to reward both knowledge and accurate reporting of credences. This is achieved if we score based on the correct answer, whether or not we also score based on the probabilities of the wrong answers.

If we also attempt to optimise for certain ratios between credences of different answers then this is at the expense of rewarding knowledge of the correct answer.

If Alice has credence levels of 50:50:: and Bob has 40:20:20:20 and the correct answer is A then Bob will get a higher score than Alice despite her putting more of her probability mass on the correct answer.

Do you consider this a price worth paying to reward having particular ratios between credences?

Comment by bucky on Bayesian examination · 2019-12-11T15:46:07.313Z · score: 3 (2 votes) · LW · GW

Maybe 1) is where I have a fundamental difference.

Given evidence A, a Bayesian update considers how well evidence A was predicted.

There is no additional update due to how well ¬A being false was predicted. Even if ¬A is split into sub-categories, it isn't relevant as that evidence has already been taken into account when we updated based on A being true.

r.e. 2) 50:25:0:0 gives a worse expected value than 50:50:0:0 as although my score increases if A is true, it decreases by more if B is true (assuming 50:50:0:0 is my true belief)

r.e. 3) I think it's important to note that I'm assuming that exactly 1 of A or B or C or D is the correct answer. Therefore that the probabilities should add up to 100% to maximise your expected score (otherwise it isn't a proper scoring rule).

Comment by bucky on Bayesian examination · 2019-12-11T14:32:33.334Z · score: 2 (1 votes) · LW · GW


50:50:0:0 says it's a coin toss between A and ¬A. If ¬A then B.

50:25:25:0 says it's a coin toss up between A and ¬A. If ¬A then its a coin toss between B and C.

Why should the scoring rule care about what my rule is for ¬A when A is the correct answer?

I'm genuinely curious - I notice you're the second person to voice this opinion but I can't get my head round it at all.

(As with my reply to aaq, this all assumes that these are genuine confidence levels)

Comment by bucky on Bayesian examination · 2019-12-11T13:53:15.288Z · score: 3 (2 votes) · LW · GW

But if I my genuine confidence levels are 50:50:0:0 it seems unfair that I score less than someone whose genuine confidence levels are 50:25:25:0 - we both put the same probability on the correct score so why do they score more?

Comment by bucky on Bayesian examination · 2019-12-11T12:21:34.489Z · score: 3 (2 votes) · LW · GW

I think the 100k drop analogy may be misleading when thinking about the final result. The final score in the version I envisage is judged on ratios between results, rather than absolute values (my explanation maybe isn't clearly enough on this). In that case putting everything on the answer which you have 60% confidence in and being right gives a ratio of 1.67 in your favour over an honest reporting. But if you do it and get it wrong then there is an infinite ratio in favour of the honest reporting.

Comment by bucky on Bayesian examination · 2019-12-11T07:21:28.964Z · score: 2 (1 votes) · LW · GW

This is true if scores from different questions are added but not if they are multiplied. Linear scoring with multiplication is exactly the same as log scoring with addition, just easier to visualise (at least to me)

Comment by bucky on Bayesian examination · 2019-12-10T22:39:00.672Z · score: 5 (3 votes) · LW · GW

Great post, Id be really interested to hear how this goes down with students.

I would be cautious about using information from incorrect answers to calculate the score - just using the percentage given for the correct answer still gives a proper scoring rule. If percentages placed on incorrect answers are included then you get 50:25:25:0 giving more points than 50:50:0:0 when the answer is A which I think people might find hard to swallow.

For a proper scoring rule I find a particular framing of a log score to be intuitive - instead of adding the logs of the probabilities placed on the correct answers, just multiply out the probabilities.

This can be visualised as having a heap of points and having to spread them all across the 4 possible answers. You lose the points that were placed on the wrong answers and then use your remaining points to repeat the process for the next question. Whoever has the most points left at the end has done the best. The £100k drop is a game show which is based on this premise.

I personally find this to be an easy visualisation with the added benefit that the scores have a specific Bayesian interpretation - the ratio of students’ scores represent the likelihood function of who knows the subject best based on the evidence of that exam.

Comment by bucky on Symbiotic Wars · 2019-12-04T09:57:39.818Z · score: 5 (3 votes) · LW · GW

This reminds me of the SSC post toxoplasma of rage (see especially section V).

Comment by bucky on Could someone please start a bright home lighting company? · 2019-12-04T09:41:02.998Z · score: 3 (2 votes) · LW · GW

I think this is likely to be orders of magnitude away from the kinds of things which have been effective for others (see e.g. this rough calculation on reddit)