When I first saw this ranking, I was surprised to see that North Koreans have such an impressive track record, especially when you factor in their relatively small population. One possible interpretation is that East Asians are just particularly good at mathematics, just like in the stereotypes, even when they live in one of the world's worst dictatorships.
But I don't believe that. In fact, I believe North Koreans are, on average, particularly bad at math. More than 40% of the population is undernourished. Many of the students involved in the IMOs grew up in the 1990s, during the March of Suffering, when hundreds of thousands of North Koreans died of famine. That is not exactly the best context to learn mathematics, not to mention the direct effect of nutrients on the brain. There does not seem to be a lot of famous North Korean mathematicians either (there is actually a candidate from the North Korean IMO team who managed to escape during the 2016 Olympiads in Hong-Kong. He is now living in South Korea. I wish him to become a famous mathematician). Thus, realistically, if all 18 years-old from North Korea were to take a math test, they would probably score much worse than their South Korean neighbors. And yet, Best Korea reaches almost the same score with only half the source population. What is their secret?
This piece on the current state of mathematics in North Korea gives it away:
“The entire nation suffered greatly during and after the March of Suffering, when the economy collapsed. Yet, North Korea maintained its educational system, focusing on the gifted and special schools such as the First High Schools to preserve the next generation. The limited resources were concentrated towards gifted students. Students were tested and selected at the end of elementary school.”
In that second interpretation, the primary concern of the North Korean government is to produce a few very brilliant students every year, who will bring back medals from the Olympiads and make the country look good. The rest of the population's skills at mathematics are less of a concern.
When we receive new information, we update our beliefs to keep them compatible with the new observations, doing an informal version of Bayesian updating. Before learning about the North Korean IMO team, my prior beliefs were something like most of the country is starving and their education is mostly propaganda, there is no way they can be good at math. After seeing the IMO results, I had to update. In the first interpretation, we update the mean – the average math skill is higher than I previously thought. In the second interpretation, we leave the mean untouched, but we make the upper tail of the distribution heavier. Most North Koreans are not particularly good at math, but a few of them are heavily nurtured for the sole purpose of winning medals at the IMO. As we will see later in this article, this problem has some pretty important consequences for how we understand society, and those who ignore it might take pretty bad policy decisions.
But first, let's break it apart and see how it really works. There will be a few formulas, but nothing that can hurt you, I promise. Consider a probability distribution where the outcome x happens with probability p(x). For any integer n, the formula below gives what we call the nth moment of a distribution, centered on μ.
To put it simply, moments describe how things are distributed around a center. For example, if a planet is rotating around its center of mass, you can use moments to describe how its mass is distributed around it. But here I will only talk about their use in statistics, where each moment encodes one particular characteristic of a probability distribution. Let's sketch some plots to see what it is all about.
First moment: replace n with 1 and μ with 0 in the previous formula. We get
which is – suprise – the definition of the mean. Changing the first moment just shifts the distribution towards higher or lower values, while keeping the same shape.
Second moment: for n = 2, we get
If we set μ to be (arbitrarily, for simplicity) equal to the mean, we obtain the definition of the variance! The second moment around the mean describes how values are spread away from the average, while the mean remains constant.
Third moment (n = 3): the third moment essentially describes how skewed (asymmetric) the distribution is.
Fourth moment (n = 4): this describes how leptokurtic or platykurtic your distribution is, that is, how extreme the extreme values are.
You could go on to higher n, each time bringing in more detail about what the distribution really looks like, until you end up with a perfect description of the distribution. By only mentioning the first few moments, you can describe a population with only a few numbers (rather than infinite), but it only gives a simplified version of the true distribution, as on the left graph below:
Say you want to describe the height of humans. As everybody knows, height follows a normal distribution, so you could just give the mean and standard deviation of human height, and get a fairly accurate description of the distribution. But there is always a wise-ass in the back of the room to point out that the normal distribution is defined over R, so for a large enough population, some humans will have a negative height. The problem here is that we only gave information about the first two moments and neglected all the higher ones. As it turns out, humans are only viable within a certain range of height, below or above which people don't survive. This erodes the tails of the distribution, effectively making it more platykurtic (If I can get even one reader to use the word platykurtic in real life, I'll consider this article a success).
Let's come back to the remarkable scores of North Koreans at the Math Olympiads. What these scores teach us is not that North Korean high-schoolers are really good at math, but that many of the high-schoolers who are really good at math are North Koreans. On the distribution plots, it would translate to something like this:
With North Koreans in purple and another country that does worse in the IMOs (say, France), in black. So you are looking at the tails and try to infer something about the rest of the distribution. Recall the plots above. Which one could it be?
Answer: just by looking at the extreme values, you cannot possibly tell, because any of these plots would potentially match. In Bayesian terms, each moment of the distribution has its own prior, and when you encounter new information, you could in principle update any of them to match the new data. So how can we make sure we are not updating the wrong moment? When you have a large representative sample that reflects the entire distribution, this is easy. When you only have information about the top 10 extreme values, it is impossible. This is unfortunate because the extreme values are precisely what gets all our attention – most of what we see in the media is about the most talented athletes, the most dishonest politicians, the craziest people, the most violent criminals, and so forth. Thus, when we hear new information about extreme cases, it's important to be careful about which moment to update.
This problem also occurs in reverse – in the same way looking at the tails doesn't tell you anything about the average, looking at the average doesn't tell you anything about the tails. An example: on a typical year, more Americans die from falling than from viral infections. So one could argue that we should dedicate more resources to prevent falls than viral infections. Except the number of deaths from falls is fairly stable (you will never have a pandemic of people starting to slip in their bathtubs 100 times more than usual). On the other hand, virus transmission is a multiplicative process, so most outbreaks will be mostly harmless (remember how SARS-cov-1 killed less than 1000 people, those were the days) but a few of them will be really bad. In other words, yearly deaths from falls have a higher mean than deaths from viruses, but since the latter are highly skewed and leptokurtic, they might deserve more attention. (For a detailed analysis of this, just ask Nassim Taleb.)
There are a lot of other interesting things to say about the moments of a probability distribution, like the deep connection between them and the partition function in statistical thermodynamics, or the fact that in my drawings the purple line always crosses the black like exactly n times. But these are for nerds, and it's time to move on to the secret topic of this article. Let's talk about SEX AND VIOLENCE.
A recent meta-analysis by Thöni et al. sheds some light on the question. Published in the journal Pyschological Science, it synthesizes 23 studies (with >8000 participants), about gender differences in cooperation. In such studies, participants play cooperation games against each other. These games are essentially a multiplayer, continuous version of the Prisoner's Dilemma – players can choose to be more or less cooperative, with possible strategies ranging from total selfishness to total selflessness.
So, in cooperation games, we expect women to cooperate more often than men, right? After all, women are socialized to be caring, supportive and empathetic, while men are taught to be selfish and dominant, aren't they? To find out, Thöni et al aligned all of these studies on a single cooperativeness scale, and compared the scores of men and women. Here are the averages, for three different game variants:
This is strange. On average, men and women are just equally cooperative. If society really allows men to behave selfishly, it should be visible somewhere in all these studies. I mean, where are all the criminals/rapists/politicians? It's undeniable that most of them are men, right?
The problem with the graph above is that it only shows averages, so it misses the most important information – that men's level of cooperation is much more variable than women's. So if you zoom on the people who were either very selfish or very cooperative, you find a wild majority of men. If you zoom on people who kind-of cooperated but were also kind-of selfish, you find predominantly women.
As I'm sure you've noticed, the title of the Thöni et al paper says “evolutionary perspective”. As far as I'm concerned, I'm fairly skeptical about evolutionary psychology, since it is one of the fields with the worst track record of reproducibility ever. To be fair, a good part of evpsych is just regular psychology where the researchers added a little bit of speculative evolutionary varnish to make it look more exciting. This aside, real evpsych is apparently not so bad. But that's not the important part of the paper – what matters is that there is increasingly strong evidence that men are indeed more variable than women in behaviors like cooperation. Whether it is due to hormones, culture, discrimination or cultural evolution is up to debate and I don't think the current data is remotely sufficient to answer this question.
(Side note: if you must read one paper on the topic, I recommend this German study where they measure the testosterone level of fans of a football team, then have them play Prisoner's Dilemma against fans of a rival team. I wouldn't draw any strong conclusion from this just yet, but it's a fun read.)
Alas, that's not how stereotyping works. Instead, we see news report showing all these male criminals, and assume that our society turns men into violent and selfish creatures and call them toxic [Here is Dworkin: “Men are distinguished from women by their commitment to do violence rather than to be victimized by it. Men are rewarded for learning the practice of violence in virtually any sphere of activity by money, admiration, recognition, respect, and the genuflection of others honoring their sacred and proven masculinity.” Remember – in the above study, the majority of unconditional cooperators were men]. Internet people make up a hashtag to ridicule those who complain about the generalization. We see all these male IMO medalists, and – depending on your favorite political tradition – either assume that men have an unfair advantage in maths, or that they are inherently better at it. The former worldview serves as a basis for publicpolicy. The question of which moment to update rarely even comes up.
This makes me wonder whether this process of looking at the extremes then updating our beliefs about the mean is just the normal way we learn. If that is the case, how many other things are we missing?
While the greater male variance hypothesis, and tail effects in general, are always interesting, I'm not sure if it's too illuminating here. It is not surprising that there are some weird outliers at the top of the IMO list, 'weird' in the sense of 'outperforming' what you'd expect given some relevant variable like GDP, intellectual freedom, HDI index, national IQ, or whatever. That's simply what it means for the correlation between IMO scores & that variable to be <1. If the IMO list was an exact rank-order correspondence, then the correlation would =1; but no one would have predicted that, because we know in the real world all such correlations are <1, and that means that some entries must be higher than expected in the list (and some lower). There's always a residual. (This is part of why tests and measured can be gamed, because the latent variable, which is what we're really interested in, is not absolutely identical in every way to the measure itself, and every difference is a gap into which optimizing agents can jam a wedge.)
When North Korea places high despite being a impoverished totalitarian dictatorship routinely struggling with malnutrition and famine, it's just the tails coming apart [LW · GW]. If we are curious, we can look for an additional variable to try to explain that residual.
For example, on a lot of economic indexes like GDP, Saudi Arabia places high, despite being a wretched place in many respects; does that mean that whipping women for going out in public is good for economic growth? No, it just means that having the blind idiot luck to be floating on a sea of unearned oil lets you be rich despite your medieval policies and corruption. (Although, as Venezuela demonstrates, even a sea of oil may not be enough if your policies are bad enough.) SA does badly on many variables other than GDP which cannot be so easily juiced with oil revenue by the state. Similarly, at the Olympics, Warsaw Pact countries infamously won many gold medals & set records. Does that mean the populations were extremely healthy and well-fed and happy? No, illegal doping and hormone abuse and coercion and professionalized state athletics aimed solely at Olympic success probably had something to do with that. Their overperformance disappeared, and they didn't show such overperformance in anything else you might expect to be related to athletics, like non-Olympic sports, popular pro sports/entertainment, or life expectancy. Or, as respectable as Russian chess players were beforehand, the Russian school of chess, particularly in the Cold War, could never have prospered the way it did without extensive state support (potentially literally, given the accusations of employing espionage techniques and other cheating*), as a heavily-subsidized, propagandized domestically & overseas, professionalized program with lifetime employment, major perks like overseas travel, safety from persecution due to politically-connected patrons, and the sheer lack of much better opportunities elsewhere. But many other areas suffered, and like so many things in the USSR (like the Moscow subway?), the chess served as a kind of Potemkin village. More recently, Nigeria boasts an unusual amount of Scrabble champions; is Nigeria actually bursting with unrealized potential? Probably not, because they don't dominate any other competitive game such as chess or checkers or poker, or intellectual pursuits in general, and Nigerian Scrabble seems to be path-dependence leading to specialization; you can easily win the annual per capita income of Nigeria at Scrabble tournaments, and there is now a self-sustaining Scrabble community telling you it's a doable career and providing an entryway. Weird, but there's a lot of games and countries out there, and one is always stumbling across strange niches, occupations, and the like which emphasize the role of chance in life.
* see Oscar's comment about NK IMO cheating, which I didn't know about, but am entirely unsurprised by.
North Korea's IMO overperformance looks like it's about the same thing as Soviet chess or Warsaw Pact athletics in general. I don't know what benefits they get (do their families get to change castes, and move to Pyongyang? immunity from prison camps? how useful is the overseas travel to them? is it a feeder into the bubble of the nuclear program? how much financial support and specialized study and tutors do they get?), but I would bet a lot that the relative benefits for a NK kid who wins at the IMO are vastly larger than for a soft suburban kid from a US magnet high school who has never attended a public execution or gone hungry, and at most gets another resume item for college. (I've seen more than one IMO competitor note that IMO is not really reflective of 'real' math, but is its own sort of involuted discipline; always a risk in competitions, and seems to have afflicted the much-criticized Cambridge Old Tripos.) This is what juices the residual: almost all countries exert merely an ordinary endogenous sort of IMO effort, and only a few see it as one of the priorities to invest a maximum effort into. NK, it turns out, sees it as a priority, like building statues, I guess. The only remaining question here about the NK IMO residual is the historical contingency: how did NK happen to make IMO one of its 'things'? Is it merely its typical envy-hatred towards China, because China for its own reasons targeted the IMO?
You can shoehorn this into a distributional argument, but when you don't know which of the moments is changing (mean? SD? skew?), or even what the distribution might be (filtering or selecting from a normal does not yield a normal), I don't find it too helpful and borderline circular. ("Why is NK performance on IMO high? Because their IMO performance distribution has a higher mean. How do we know that? Because their IMO performance is high.") Pointing at the imperfect bivariate correlation and analyzing the possible causes of a residual is much more informative. When you look at the state involvement in IMO, it explains away any apparent contradiction with what you believed about correlations between intellectual achievement and GDP or whatever.
As far as I understand, the tails coming apart and the moment attribution are two different, superimposed problems. The tails coming apart is "Nigeria has the best Scrabble players in the world, but the persons with the richest English vocabulary in the world are probably not Nigerian". The moment attribution is "the best Scrabble players in the world are Nigerian, but Nigerians are probably not the best Scrabble players in the world". In the first case, we are talking about the distribution of country scores for two correlated variables, in the second we are talking about the distribution of individuals within a country for a single variable.
Also, thank you for bringing up Nigerian Scrabble, that would have made a somehow funnier example than NK's math olympiads.
The tails coming apart is "Nigeria has the best Scrabble players in the world, but the persons with the richest English vocabulary in the world are probably not Nigerian"
No. The tails coming apart here would be "gameplaying of game A correlates with national variable B but the top players of game A are not from the top country on variable B".
I say it's borderline circular because while they aren't the same explanation, they can be made trivially the same depending on how you shuffle your definitions to save the appearances. For example, consider the hypothesis that NK has exactly the same distribution of math talent as every other country of similar GDP, the same mean/SD/etc, but they have a more intense selection process recruiting IMO participants. This is entirely consistent with tails coming apart ("yes, there is a correlation between GDP and IMO, but it's r<1 so we are not surprised to see residuals and overperformance which happens to be NK in this case, which is due to difference in selection process"), but not with the distributional hypothesis - unless we post hoc modify the distribution hypothesis, "oh, I wasn't talking about math talent distributions per se, ha ha, you misunderstood me, I just meant, IMO participant distribution; who cares where that distribution difference comes from, the important thing is that the NK IMO participant distribution is different from the other countries' IMO participant distributions, and so actually this only proves me right all along!"
More recently, Nigeria boasts an unusual amount of Scrabble champions; is Nigeria actually bursting with unrealized potential? Probably not, because they don't dominate any other competitive game such as chess or checkers or poker, or intellectual pursuits in general, and Nigerian Scrabble seems to be path-dependence leading to specialization; you can easily win the annual per capita income of Nigeria at Scrabble tournaments, and there is now a self-sustaining Scrabble community telling you it's a doable career and providing an entryway.
It might not just be specialization on Scrabble. English is the official language in Nigeria. I think it's plausible that Nigerian elite English education focuses more strongly on learning a lot of words then US English education.
There are many countries besides Nigeria where English is an official language, elite language, or widely taught. And language proficiency apparently has little to do with Scrabble success at pro levels where success depends on memorizing an obsolete dictionary's words (apparently even including not really-real words, to the point where I believe someone won the French Scrabble world championship or something without knowing any French beyond the memorized dictionary words).
To noodle a bit more about tails coming apart: asymptotically, no matter how large r, the probability of a 'double max' (a country being the top/max on variable A correlated r with variable B also being top/max on B) decreases to 1/n. The decay is actually quite rapid, even with small samples you need r>0.9 to get anywhere.
A concrete example here: you can't get 100%, but let's say we only want a 50% chance of a double-max. And we're considering just a small sample like 192 (roughly the number of countries in the world, depending on how you count). What sort of r do we need? We turn out to need r ~ 0.93! There are not many correlations like that in the social sciences (not even when you are taking multiple measurements of the same construct).
Some R code to Monte Carlo estimates of the necessary r for n = 1-193 & top-p = 50%:
The comparison to chess is maybe more accurate than you think. See stuff like: Beginnings: The first IMO was held in Romania in 1959. It was initially founded for eastern European member countries of the Warsaw Pact, under the USSR bloc of influence, but later other countries participated as well. (source https://en.wikipedia.org/wiki/International_Mathematical_Olympiad) Also classic geometry is (to my knowledge) taught more generally in many eastern European countries (and make up 1/6-1/3 of the imo).
Also the note about incentives being larger in North Korea also applies to much of eastern Europa to a lesser degree, where qualifying for imo is seemingly enough to get access to any university (source: Sankt Petersberg university gave an open offer at Baltic Way (a regional math competition), and i know someone who used something like that to get into Moscow university)
(Romania, Serbia, Poland, Russia, Ukraine Hungary are the eastern european countries with consistently good results)
A comparison to many Olympic sports also fits here as well. Just look at the success of Bulgaria in weightlifting throughout the 80s and 90s. Strong incentives, culture, coaching, and some cheating all played a role, just as I am guessing they do for IMO success.
It's possible to study very specifically for Olympiad math tests. Anecdote: In USA math Olympiad training camp in the early 2000s, I specifically remember that there was this useful technique that required quite a bit of rote memorization to be able to use maximally well (solving geometry problems by treating the points as complex numbers), and only 2 people out of 25ish bothered to memorize the entire sheet of formulas. I wasn't one of them ... It seemed unpleasant, I was a teen, I didn't care that much.
If South Koreans etc. study N hours motivated by whatever, then I can easily imagine that North Koreans study 100*N hours motivated by life-and-death consequences for them and their families, and having no other classes etc. for years on end, or whatever it is. That could make a big difference.
I feel conflicted about this post. Its central point as I'm understanding it is that much evidence we commonly encounter in varied domains is only evidence about the abundance of extremal values in some distribution of interest, and whether/how we should update our beliefs about the non-extremal parts of the distribution is very much dependent on our prior beliefs or gears-level understanding of the domain. I think this is a very important idea, and this post explains it well.
Also, felt inspired to search out other explanations of the moments of a distribution - this one looks pretty good to me so far.
On the other hand, the men's rights discussion felt out of place to me, and unnecessarily so since I think other examples would be able to work just as well. Might be misjudging how controversial various points you bring up are, but as of now I'd rather see topics of this level of potential political heat discussed in personal blogposts or on other platforms, so long as they're mostly unrelated to central questions of interest to rationalists / EAs.
North Korea were caught cheating in 1991 and given a 15 year ban until 2007. They were also disqualified from the 2010 IMO because of weaker evidence of cheating. Given this, an alternative hypothesis is that they have also been cheating in other years and weren't caught. The adult team leaders at the IMO do know the problems in advance, so cheating is not too hard.
The Danish team leader (the closest source to these events i have talked to) seems to (personally) believe the cheating allegations in 2010 or at least that the evidence was insufficient. Also note their non participation in 2017 and 2018 for reasons not known to me and in 2020 likely because the event was online
As it turns out, humans are only viable within a certain range of height, below or above which people don't survive. This erodes the tails of the distribution, effectively making it more platykurtic (If I can get even one reader to use the word platykurtic in real life, I'll consider this article a success).
Human height distribution is actually fat tailed (I remembered it and quick Googling finds that on Wikipedia).
That's a bold statement! The wiki article has a  and that sounds wild. Typically, if height was heavy-tailed, we would expect the tallest person to be more than twice as big as the second tallest person. But then, Jeff Bezos is not twice as rich as Elon Musk, so it doesn't always work...
Typically, if height was heavy-tailed, we would expect the tallest person to be more than twice as big as the second tallest person.
Not necessarily. You can have fat tails without having them being that fat. In the history of the normal distribution some scientist was studying human height and his test subjects did were normal distributed.
It however happens to be the case that dwarfism happens more then the normal distribution would predict.
Height is strongly genomically driven. Currently we can predict it with +-4cm based on genetic analysis. You have a bunch of different genes that cause small changes in height and if those genes would be all that matters for height differences, height would be normally distributed the way the outcome of 1000 coin flips is normally distributed. Dwarfism however isn't the result of a bunch of small changes of height from a bunch of different genes but can be caused through a single mutation.
Tails can alternate between fat and thin as you go further out. If heights were normally distributed with the same mean and variance then there would be fewer people above 7ft than there are now, but the tallest man would be taller than the tallest man now.
Yeah, THIS is what I hate in children's books about animals. This cat is the fastest runner? That bird has the sharpest vision? The dinosaur is the tallest one ever? Who even cares? It's not like they compete with other animals we judge / have recorded to be almost as fast etc. We never get to hear about the silvers, much less the bronzes.
Annnd... This cat "can't maintain the speed for long"? Who even cares? It's not like we ever read anything about the variability of the cat's speed, the bird's sharpness of vision...and these aren't even extinct... nobody measures it. Nobody measures it, but lots of people compare their freaking favorite species like there's some kind of Platonic Zoology we're all supposed to agree on.
(At least the invertebrates are simply weird, each one in its own dimension.)