Let's stop making "Intelligence scale" graphs with humans and AI
post by Expertium (lavrov-andrey) · 2025-05-09T16:01:33.655Z · LW · GW · 14 commentsContents
Let's stop making and spreading these. None 14 comments
You've probably seen this:
or this:
Or something similar to these examples.
Let's stop making and spreading these.
Recently, I asked Gemini 2.5 Pro to write a text with precisely 269 words (and even specified that spaces and punctuation don't count as words), and it gave me a text with 401 words. Of course, there are lots of other examples where LLMs fail in surprising ways [LW · GW], but I like this one because it's super simple. At the same time, Gemini can write Python code and speak dozens of languages and can most likely beat me at GeoGuessr. Yet at the same time, it sucks at Pokemon.
This suggests that AI is developing in ways that are deeply inhuman. Can you imagine a human who can write you Python code, then Rust code, then write you a letter in German, then write you a letter in Japanese...and then cannot beat* takes hundreds of hours to beat Pokemon (even when you're practically holding his hand during every step), can't count the number of words in the text that he just wrote or write a story without mixing up character names/ages after the first 10 pages, and can't order pizza? Can you even imagine a hypothetical environment where a human could grow up to become like that? Even if some comic book crazy scientist wanted to create a human like that on purpose by raising him in a "The Truman Show"-esque dome where everyone is a paid actor, I still don't think he could succeed.
Nothing like this exists in nature. There is no way to put humans (or animals, for that matter) and AI on the same scale in a coherent way. At least not if the scale has only one dimension.
I think most people, including myself, were expecting that AI (LLMs in particular, I mean) would be progressing at the same rate across all tasks. If that was the case, then putting humans and AI on the same scale would make sense. But we weren't expecting that AI would be comparable to humans or even better than humans at some tasks while simultaneously being utterly hopeless at other (even closely related!) tasks.
*edit: Gemini has actually finished Pokemon, which I didn't realize when writing this post. My bad.
14 comments
Comments sorted by top scores.
comment by Steven Byrnes (steve2152) · 2025-05-10T13:21:03.461Z · LW(p) · GW(p)
Have you seen the classic parody article “On the Impossibility of Supersized Machines”?
I think it’s possible to convey something pedagogically useful via those kinds of graphs. They can be misinterpreted, but this is true of many diagrams in life. I do dislike the “you are here” dot that Tim Urban added recently.
Replies from: lavrov-andrey↑ comment by Expertium (lavrov-andrey) · 2025-05-10T14:16:35.545Z · LW(p) · GW(p)
Have you seen the classic parody article “On the Impossibility of Supersized Machines”?
No, lol. That's a good one.
comment by RogerDearnaley (roger-d-1) · 2025-05-13T04:41:53.291Z · LW(p) · GW(p)
Try speaking aloud for precisely 269 words. You're not allowed to count or recite poetry — you have to do this while actually extemporizing something interesting to say.
Now bear in mind that an LLM doesn't output letters or words, it uses tokens. In order to count words, an LLM has to memorize which tokens contain spaces and which don't. So for it the task is comparable to asking a human to speak until they've said 269 words that begin with the letter T.
comment by bhauth · 2025-05-09T23:29:32.294Z · LW(p) · GW(p)
The main problem with this post is that it assumes "AI" is a monolithic thing that can't include different systems in the future. LLMs can translate from English to German. AlphaZero can beat the best human players at Go. Different systems can do different things. ChatGPT can't evaluate the result of Javascript programs accurately, but if you set it up to run Node.js, it suddenly can.
comment by Raphael Roche (raphael-roche) · 2025-05-09T21:53:10.632Z · LW(p) · GW(p)
That's true, but not specific to AI. Where do you place chimp on the scale ? Low under human ? Ok, now consider an athletic man really good at brachiation, and a random chimp. It's not just that the chimp has more strength or agility in arms. I'll bet that the chimp will be definitely best at cognitive task like determining a good path across the trees, which branch to avoid etc. The chimp would also likely surpass most humans in recognition of some sorts of plants, fruits etc.
And what about savant autists like the guy who inspired Rain Man, or the twins John and Michael mentioned by Oliver Sacks in his book 'The Man Who Mistook His Wife for a Hat'? Where are we supposed to place them on the curve ?
I've also read that bees can count up to 4 or 5, which surpasses some mammals or human children before the age of 2 or 3. Where do the bees go on the curve ?
Intelligence is likely not something that can be plotted on a simple curve. This could actually be advantageous for AI safety. Foom might be avoided if misaligned AIs have uneven cognitive capabilities and occasionally make significant errors in judgment.
comment by frontier64 · 2025-05-10T14:13:11.539Z · LW(p) · GW(p)
How does AI being good at some tasks and worse at others make the graph you posted not a good tool at explaining FOOM or increasing AI capabilities?
comment by Afterimage · 2025-05-10T06:15:17.951Z · LW(p) · GW(p)
I find this graph useful. I think you can agree at some point that AI will be more intelligent than humans, even if AI intelligence is quite different and lacking in a few (fewer every year) areas. If this is the case then this graph is quite effective at conveying that this point may be happening soon.
comment by sam b · 2025-05-09T17:10:31.362Z · LW(p) · GW(p)
I've generally found it much harder over time to find "examples where LLMs fail in surprising ways". If you test o3 (released the day after that post!) for the examples they chose, it does much better than previous models. And I've just tried it on your "269 words" task, which it nailed.
Replies from: lavrov-andrey↑ comment by Expertium (lavrov-andrey) · 2025-05-09T17:36:08.346Z · LW(p) · GW(p)
To be clear, I'm not claiming that the "write a text with precisely X words" task is super-duper-mega-hard, and I wouldn't be surprised if a new frontier model was much better at it than Gemini. I have a very similar opinion to the author of this post [LW · GW]: I'm saying that given what the models currently can do, it's surprising that they also currently can't (reliably) do a lot of things. I'm saying that there are very sharp edges in models' capabilities, much sharper than I expected. And the existence of very sharp edges makes it very difficult to compare AI to humans on a one-dimensional intelligence scale, because instead of "AI is 10 times worse than humans at everything", it's "AI is roughly as good as expert humans at X and useless at Y".
Replies from: Richard_Kennaway, sam b↑ comment by Richard_Kennaway · 2025-05-09T19:03:11.175Z · LW(p) · GW(p)
>instead of "AI is 10 times worse than humans at everything", it's "AI is roughly as good as expert humans at X and useless at Y".
How long before it's "AI is out of sight of expert humans at X and merely far above them at Y"?
Replies from: lavrov-andrey↑ comment by Expertium (lavrov-andrey) · 2025-05-09T19:25:32.572Z · LW(p) · GW(p)
Well, if we extrapolate from the current progress, soon AI will be superhumanly good at complex analysis and group theory while only being moderately good at ordering pizza.
That's why I think that comparing AI to humans on a one-dimensional scale doesn't work well.
Replies from: sam b↑ comment by sam b · 2025-05-09T19:34:50.772Z · LW(p) · GW(p)
If you extrapolate further, do you think the one-dimensional scale works well to describe the high-level trend (surpassing human abilities broadly)?
Trying to determine if the disagreement here is "AI probably won't surpass human abilities broadly in a short time" or "even if it does, the one-dimensional scale wasn't a good way to describe the trend".
↑ comment by Expertium (lavrov-andrey) · 2025-05-09T19:51:30.753Z · LW(p) · GW(p)
The latter.
↑ comment by sam b · 2025-05-09T19:32:35.408Z · LW(p) · GW(p)
I agree that AI capabilities are spiky and developed in an unusual order. And I agree that because of this, the single-variable representation of intelligence is not very useful for understanding the range of abilities of current frontier models.
At the same time, I expect the jump from "Worse than humans at almost everything" to "Better than humans at almost everything" will be <5 years, which would make the single-variable representation work reasonably well for the purposes of the graph.
I think these "examples of silly mistakes" have not held up well at all. This was often blamed on "training around the limitations"; however, in the case of the linked post, we got a model the next day that performed much better.
And almost every benchmark and measurable set of capabilities has rapidly improved (in some cases beyond human experts).
"We too often give wrong answers to questions ourselves to be justified in being very pleased at such evidence of fallibility on the part of the machines. Further, our superiority can only be felt on such an occasion in relation to the one machine over which we have scored our petty triumph."
Alan Turing, Computing Machinery and Intelligence
1950