Superintelligent AI will make mistakes

juggins

Superintelligent AI will make mistakes

post by juggins · 2025-01-30T15:12:50.561Z · LW · GW · 2 comments

  How does superintelligence happen?
  Intelligence is generalised ‘knowing-how’
  Learning requires failure
  There’s always a harder problem
  Generalisation won’t be perfect
  The incompetence gap
  Self-calibration is hard
  What to do about this?
None
2 comments

[Crossposted from my substack Working Through AI. I'm relatively new to writing about AI safety, so would appreciate feedback. If you would prefer to do so anonymously, I have a feedback form.]

TLDR: Stories about superintelligent AI don’t usually address the idea that it may make catastrophic mistakes. I argue that we should expect these by default. My reasoning is that: (1) gaining intelligence requires taking actions in the world; (2) many of these actions will fail; (3) there is always a harder task to fail at; and (4) generalisation will only go so far to mitigate this. I introduce the idea of an ‘incompetence gap’ between what an AI attempts to do and what it can reliably do, and I conclude that this problem resolves partly to an alignment issue around managing risky behaviour.

In the classical presentation of the alignment problem, superintelligent AI is basically omnipotent. This is a key point you have to internalise to understand writing on the subject. Whatever bright idea you get about why a superintelligence won’t be able manipulate you or prevent you from shutting it down is dead wrong. It will be smarter than you to the same degree that you are smarter than an ant. In any competition between you, it will be playing 100-dimensional chess, besting you with moves you can’t even conceive of. It won’t just have mastered nanorobotics or whatever, it will have transcended technology, reaching behind the cosmological veil to truer forms of power. Good luck staying in control of the future^[1].

I believe that AI surpassing human intelligence is likely and may happen soon. But I think stories of effective omnipotence are skipping over something important. I want to discuss the risks from superintelligent incompetence. While it is obvious that current AI is fallible, I am going to argue that this will remain the case even after it can dramatically outperform humans on all tasks. Regular and consequential mistakes should be expected by default.

How does superintelligence happen?

Fundamental to the superintelligence story is the idea that there is a distinct thing called ‘intelligence’ and that you can turn this parameter up to essentially infinity. A high-profile version roughly follows this pattern^[2]:

Someone, somehow builds an AI that has more intelligence than any human^[3].
The AI uses its greater intelligence to recursively self-improve: it modifies itself to be smarter or builds a smarter replacement with the same goals, and it does this really, really fast^[4].
The AI is now omnipotent.

I think this narrative makes a serious over-approximation: it assumes the AI has basically no negative impact on the world while it is scaling up. Sometimes [LW · GW] this assumption takes the form of a scheme, where the AI quietly builds up dangerous capabilities, releasing them only when it knows it is strong enough to take over the world. My objection is that this leaves no room for it to make mistakes — it assumes that once the human-level intelligence threshold is crossed, essentially everything significant that the AI attempts will be successful. Fundamentally, the story is too clean. My reasoning for this can be summarised as follows:

Intelligence is generalised ‘knowing-how’ acquired by taking actions in the world.
The process of learning to complete tasks, and thereby gain intelligence, involves a significant amount of failure.
Even once an AI is demonstrably reliable on a wide range of beyond-human-level tasks, there will always be another, harder, set of problems for it to fail at.
While acquiring intelligence through completing purely simulatable tasks will generalise somewhat to real ones, there will be a limit to this, even when generalisation is very strong.

I’ll now go through each of these points in more detail.

Intelligence is generalised ‘knowing-how’

There’s a distinction in philosophy between ‘knowing-that’, e.g. London is the capital of England, and ‘knowing-how’, like riding a bike. I’m not going to get into the philosophical debate around this, but I want to put forward the claim that what we commonly mean by intelligence is all downstream of knowing-how^[5]. And that it is learnt through practice, by taking actions in the world.

If I asked you to name examples of intelligence, there are lots of different kinds of things you might pick. When I asked ChatGPT, it gave me eight distinct categories (Claude gave five), with a variety of choices including solving maths problems, juggling, composing music, and empathising with a friend. One thing these all have in common is that they are, in a sense, tasks. They involve achieving something real, interacting with the world outside of your own brain.

As we navigate our environment, we are constantly presented with tasks to solve. As a baby, these will be very basic, like figuring out what our hands do. But as we get older they become more complex, usually building on and extending skills we have previously learnt (hence generalised knowing-how). And this learning process is dominated by practice. It is by listening and trying to make sounds that we learn to speak; by doing problem sets at school that we learn arithmetic; by predicting and reacting to other people’s behaviour that we learn emotional intelligence. In each case, our brains build up pathways representing the skill we are learning, allowing us both easy reuse and a head-start on any related new skills. We are creating mappings from situations to actions, that help us better achieve our goals.

Evolution was doing something similar when it created us. To succeed in our environment and pass on our genes, we needed the ability to complete a variety of tasks. We needed to hunt, to forage, to find shelter, and escape from predators. And we needed to be able to learn new skills quickly in order to adapt to change. Humans evolved brains suitable for achieving this — we evolved to be mappings from environmental inputs to adaptive outcomes. That you were born with the ability to learn mathematics is because evolution trained the human species to solve novel abstract reasoning problems, like making better tools. And it did so because people less capable at this were less likely to pass on their genes.

It’s actually when thinking about machine learning that the primacy of knowing-how is easiest to see. When you set up a machine learning model, you first define a task for it to learn. This might be forecasting the values of a share price, distinguishing between photos of cats and dogs, or predicting what product a customer might want to buy next. The model then figures out the parameters for itself — it learns the mapping from the training data to the task labels.

Large language models learn how to complete sequences of tokens. All of their knowledge and utility is downstream of this. Next-token-prediction trained over the internet turns out to be pretty useful for solving a wide variety of problems, and relatively easy to supplement with fine-tuning (subtly modifying the mappings to work better on certain tasks), which is why machine learning researchers ‘evolved’ LLMs out of the previous generation of language models.

It is interesting that the machine learning community generally talks about capabilities rather than intelligence. That is, the primary measure of progress is in task completion ability — in getting higher scores on benchmarks — with the notion of model intelligence a kind of generality vibe tacked on at the end.

There is an important implication from all this. If the key process in gaining intelligence is learning the mapping required to turn a context into a useful result, then superintelligent capabilities are defined by complex mappings that are beyond human abilities to learn. And the only way an AI can learn these is by practicing doing super-advanced tasks.

Learning requires failure

In all the examples of learning I gave above, from childhood arithmetic to large language models, one ubiquitous element of the process is failure. When a person or a model practices a skill, they do not get it right every time.

Granted, there are situations where previously learnt capabilities allow you to succeed on a novel task first time^[6], like multiplying two numbers you’ve never tried to multiply together before. But you can do these because of previous learning where you did pay the cost. You learnt an algorithm for multiplying numbers together that you could reuse, and you did so through practice. Furthermore, evolution paid a cost training you to have the learning algorithm that allowed you to do this in the first place.

It has to be this way. To have the mapping in your brain corresponding to a particular capability, you need to acquire that information from somewhere. You have to extract it from the world, bit by bit, by taking your best guess and observing the results. And by definition, if you don’t already have the capability then your best guess will not be good.

This process of trial and error is so intrinsic to machine learning it seems almost stupid writing it down. Neural networks learn by doing gradient descent on a loss function — in other words, by gradually correcting their outputs to be less wrong than they were before. A model builds the mapping it needs by taking its best guess, observing how wildly it missed, and then updating itself to make a better guess next time. Superintelligent AI will still have to do this^[7]. If it wants to learn a new capability, it will need to take actions in the world (or in a good enough simulation) and update its mapping from inputs to outputs based on the results it gets. In the process, it will fail many times.

There’s always a harder problem

Back in 2018, researchers from Stanford released a reading comprehension benchmark called ‘A Conversational Question Answering Challenge’ or CoQA. The idea of this benchmark was to test whether language models could answer questions of the following format:

Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well. Jessica had . . .
Q1: Who had a birthday?
A1: Jessica

When GPT-2 was released a year later, CoQA was one of the benchmarks OpenAI tested it on. It scored 55 out of 100^[8]. This wasn’t as good as some models, but was impressive on account of GPT-2 not having been fine-tuned for the task. Then, a year later, when GPT-3 dropped, OpenAI once again chose to highlight the progress they had made on CoQA. Now the model was almost at human-level with a score of 85, and very close to the best fine-tuned models.

*GPT-3’s performance on CoQA. Stacking more layers is great for reading comprehension.*

This has been a common sight since the start of the deep learning revolution in the 2010s. Researchers build benchmarks they think are difficult, only for models to breeze past them in a couple of years. Each time, a renewed effort is made to build a much, much harder benchmark that AI won’t be able to crack so easily, only for it to suffer the same fate just as quickly.

One example of such a benchmark is ARC-AGI. This was put together in 2019 by Francois Chollet to specifically test “intelligence”, which in his words meant “skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty”. The benchmark consists of grids of different coloured blocks, and the challenge is to spot the pattern in some examples, and reproduce it on a test case. There are many different patterns across all the examples. These are pretty easy for humans, but harder for AI. GPT-3 scored 0%.

*You can try some ARC-AGI problems yourself on the* *ARC Prize website*.

For a number of years this benchmark held strong. GPT-4o scored just 5%, and as of mid-2024 the best result by any model was still only 34.3%. Then, with a crushing sense of inevitability, in December 2024 Chollet announced that OpenAI’s latest reasoning model, o3, had scored an enormous 87.5%^[9]. Chollet has already begun working on ARC-AGI-2, an even harder version.

There’s no obvious reason why this process is going to stop after AI surpasses human intelligence. While Francois Chollet will not be able to keep building ARC-AGI-Ns forever, our new state-of-the-art artificial AI researchers will. Their benchmarks will be full of tasks difficult for us to conceive of, but comprehensible to other AIs.

On the one hand, this story is really bullish about AI progress. Benchmark after benchmark has fallen and will continue to fall as the the machines rise up to greatness. On the other, though, it shows AI repeatedly failing at things. And when it finally succeeds on a benchmark, another, harder once is produced for it to fail at.

The question is whether this pattern of failure will continue, whether AI will keep promoting itself to a new level of incompetence. Really, answering this comes down to what level of capabilities constitutes mastery of the universe. If this point gets reached just above human-level, then slightly superhuman AI will ace everything with 100% reliability. But this would be incredibly arbitrary. For all the richness and complexity of the universe, it would be strange if merely optimising over the human ancestral environment created almost enough intelligence to unlock the deepest mysteries of the cosmos. Certainly, for stories about godlike superintelligence to make sense, there would need to be a lot of gears left for the AI to go through.

It seems likely, then, that even strongly superhuman AI will face problems it will struggle with. And this is where the danger creeps in. There’s a scene in Oppenheimer where the Manhattan project physicists are worried about the possibility that a nuclear detonation will ignite the atmosphere and destroy the world. While their fear was exaggerated somewhat in the film, it illustrates how when you dial up the power and sophistication of the actors in play, you begin to flirt with the catastrophic. Who knows what unfathomable scientific experiments advanced AI will get up to^[10], and what their score will be on ‘Big And Dangerous Intelligence-Demanding Experiments Assessment’ (BAD-IDEA). Spoiler: it won’t be 100%.

Generalisation won’t be perfect

So far, we have built up an argument as to why AI recursively self-improving will be, at the very least, rather messy. To acquire new capabilities, AI will need to take actions in the world, many of which will fail, and this problem will continue indefinitely, even when the AI is superintelligent and doing incomprehensibly difficult stuff. There is, however, one possible route around this set of issues: strong generalisation.

When I talk about intelligence being generalised knowing-how, what I mean by this is that a lot of skills are correlated. This is why the single term intelligence is useful in the first place. Being good at writing usually means you are also good at mathematics, or at least that you can learn it quickly. The correlation isn’t perfect, but the point stands that learning certain skills can go a long way to making others easier, including ones you haven’t specifically practiced before. For a concrete example, consider how the kind of abstract reasoning evolution trained into humans in the ancestral environment has generalised to building space rockets and landing on the Moon.

Generalisation is usually given a central role in the story of superintelligence. This is not surprising. Machine learning is in a way entirely about generalisation. We train on the training set with the goal of generalising to the test set. If we can train a model to superintelligence on purely simulatable tasks — where you don’t have to take any actions in the real world — and this generalises strongly, then we might be able to get powerful AI without exposing ourselves to consequential mistakes.

What kind of tasks might we pick to do this? The success of recent ‘reasoning’ models like OpenAI’s o3 and DeepSeek’s R1 looks like it derives from doing reinforcement learning over chain-of-thoughts for maths and coding problems. These domains don’t require interacting with the real world as they have verifiable ground truths, making them ideal for this kind of training. Let’s speculate a bit and assume these models will continue getting better quickly, perhaps initiating some kind of feedback loop where they set themselves harder and harder problems to solve, which successfully booststraps them to superhuman levels. Then not long afterwards, let’s say they get put in charge of AI research, kickstarting recursive self-improvement.

Unless the models’ creators want them to stay whirring away in a box forever doing nothing but improving themselves on simulations, at some point these scarily-capable AIs will be asked^[11] to do something high-impact in the real world, like optimise the economy. Will being able to prove the Riemann hypothesis or efficiently find the prime factors of large numbers help them do this? Almost certainly. I suspect it will help a great deal. Mathematics seems like the language of the universe in some sense, so mastering it will confer some widely applicable skills. But — and this is the key point — there will always be a limit. Generalisation may go far, but it will not be perfect.

To see why, let’s look closer at how being better at mathematics might help AI solve real problems. Humans have collected all kinds of data about the world, and superhuman maths skills would let the AI build better models of the generating processes for this data, getting closer to the underlying reality. The AI could then apply these better models to achieve superhuman performance on real problems. But there are two limiting factors here. First, mathematics was developed either by humans to help solve problems in our environment, or was enabled by evolution selecting us to do the same^[12]. The rules of the game, which we enforce on our AIs as defining ‘correct’ answers, are intrinsically human-level. Second, so is the data made available to the AI, which defines for it what reality actually is. It was collected by humans for human purposes, and the AI will always be limited if it doesn’t actually take actions in the world to collect more.

To make this a little more concrete, let’s imagine scientists from the late 19th century had had access to an AI trained on superhuman maths problems but without post-19th-century knowledge. They could have achieved a lot with this AI, but they could not immediately have used it to discover quantum mechanics. It would not, for example, have had the right information to predict the behaviour of electrons in the double slit experiment. Discovering quantum mechanics required physically extracting new information from reality.

The implication from this is that a superintelligent AI trained purely on simulations will always have gaps in its real-world capabilities. There will be somewhere beyond the training distribution, even if it’s very far, where the AI’s model of the universe will not match the real thing, leading to a meaningful drop in performance. To rectify this, it will have to train in the real world and risk consequential failures. Better hope none of them ignite the atmosphere.

The incompetence gap

I find it productive to think about all this in terms of what I call the incompetence gap:

For a given AI deployed in a given context, what is the gap in capabilities between the tasks it will attempt to do and those it can reliably do every time?

If you like, this is a qualitative measure of how far into the the ‘stretch zone’ the AI is going to be. You could make it more quantitative by measuring historical failure rates (although this is retrospective and would miss ‘unknown unknowns’ that haven’t happened yet), and weighting by the seriousness of the failures, but I think it’s important to retain the qualitative sense of incompetence. We want to know the degree to which a model is going to be pushed to its limits. Does it have a well-calibrated sense of its these? Was it designed to work in the given context? Does it have a protocol in place stopping it from experimenting in consequential settings? Does it train only in special sandboxes or is it always learning? Is it trying to avoid situations where it might make a mistake?

At first glance you might think, well duh, of course we will only deploy ~99.99999% reliable models in situations where terrible mistakes can be made. Why would we be so stupid as to deploy incompetent models? But I can think of two pretty good reasons^[13]:

Because taking the risk leads to power.
Because if from our vantage point the AI has godlike capabilities, why would we be worried about mistakes?

Imagine you have an AI that performs really well in simulations where it starts a company and makes you trillions of dollars. Do you deploy that model, letting it autonomously crack on with its grand plans, knowing full well that you’re too stupid to exercise meaningful oversight? What if it is a military AI and it promises you victory over your enemies? Even if people knew that it would only work X% of the time, many would still press the button.

And that’s just considering people knowingly flirting with danger. If we get used to a world in which AI is way more competent than we are, we may effectively forget that it is probably still fallible. If AI can take actions we cannot conceive of, and has a dramatically richer view of the world than we do, then we won’t be able to tell the difference between competent and incompetent plans^[14].

Self-calibration is hard

Perhaps one conclusion we can draw from all this is that AI must have a well-calibrated sense of its own limitations — it must have good probability estimates for what the consequences of its actions will be, knowing which of its skills it can execute reliably and which are more experimental. It must explore its environment safely, make appropriate plans for how to achieve things with critical failure modes, and give reasonable justifications for trying in the first place.

Of course, superhuman AI will be able to do this easily with respect to human-level capabilities. The problem is that we also need it to to do so at the edge of its own. The trouble isn’t just that it might be slightly off with its belief it can do X skill Y% of the time, it is that it may not even have the right ontology. The universe is vast, deep, and, for all we know, of almost limitless complexity. The story of scientific progress has been one surprising discovery after another, radically reshaping the categories and concepts we use to make sense of the world. Once again, only if you believe that human-level intelligence happens to be just below the threshold required for the whole blooming, buzzing confusion to come into clear focus, will this seem a tractable problem.

And that’s assuming the AI is even optimising for it. One interesting thing in the GPT-4 technical report was how the pre-trained model was actually really well calibrated on the MMLU dataset — a mixture of questions covering “57 tasks including elementary mathematics, US history, computer science, law, and more” — yet after post-training (RLHFing to become a well-behaved chatbot) it got significantly worse. There was a tradeoff between failure modes here, and calibration was the loser.

Now to be very clear, I think prioritising alignment over calibration is the right choice. In the hierarchy of problems, getting alignment right is paramount. And it’s not a surprise that OpenAI agreed in this case: GPT-4 saying bad words is going to cause more grief for them than it being overconfident about its grasp of US history. But I think it is important to recognise there was a tradeoff here.

*Would you rather be destroyed on purpose or by accident?*

I suspect the future of AI is going to be full of these kind of tradeoffs. Staying in the world of chatbots, an illustrative one is between helpfulness and harmlessness. It’s unhelpful to refuse a user request for bomb-building instructions, but it’s harmful to accede to it. If a chatbot had flawless knowledge of which requests are harmful and which aren’t, it could strike the perfect tradeoff. But, operating at the limits of their capabilities, real chatbots don’t have this — so they make mistakes.

*Anthropic* *found tradeoffs in helpfulness and harmlessness space. The points further to the right represent later steps in post-training, eventually reaching a pareto frontier for each method.*

In a complex universe, approximation and risk are necessary to get things done. Getting in a car, making career choices, posting on the internet — many things involve risk. Sure, I could reduce my exposure by shutting myself away and never trying anything of consequence, but I’d be pretty useless if I did.

What to do about this?

The obvious thing is trying to ensure the first artificial AI researchers are well calibrated, and in a way that doesn’t detract from their alignment. Whatever mistakes they make may compound into the future.

But more generally, I think solving this problem comes back to alignment itself (at least partly). This may be a surprise to anyone familiar with the AI safety discourse. I'm pretty sure many AI safety researchers would consider the problems I've described in this post as capabilities problems and therefore outside the appropriate scope of the field. They might suggest that (a) beyond-human-level AI will have a much better chance of solving this than we do, so why bother, and (b) capabilities research is bad because it hastens the arrival of superintelligence and we aren’t ready for that yet.

As far as I can tell, the reason AI safety researchers don't usually worry too much about capabilities failures is because capabilities are instrumentally useful. For example, being able to code or knowing how gravity works are important for achieving other things an AI might care about — they make it easier to attain whatever its goals are. So the default case is that advanced AI will figure these things out by itself, without needing our help. Alignment, by contrast, is different. If an AI thinks humans are slow and stupid, then it isn't instrumentally useful to care about them, so any slight mis-specification of the AI’s goals has no reason to correct itself. In other words, the default case is misaligned AI.

My argument is that catastrophic capabilities failures are also the default case. While it is instrumentally useful to avoid mistakes, you can’t do this if you don’t know you are about to make one. And because even vastly superhuman AI will still not be perfect, there will be a whole class of errors it cannot anticipate. In any case, it is not instrumentally useful for an AI to avoid mistakes that are disastrous for humans but not for itself, like creating lethal pathogens or destabilising the environment. So exposure to superintelligent incompetence is yet another failure mode from bad alignment.

Another reason our alignment choices matter here is that the problems I have described are the helpfulness and harmlessness tradeoff writ large. We are planning to put AI into risky environments, into situations it has not yet completely mastered, and ask it to do things for us. Its safest option (and sufficiently advanced AI will be well aware of this) will often be to do nothing^[15]. Yet we will try to compel it to take risks in the name of utility. Even in the happy case where our alignment techniques are effective, we still need to ask ourselves: how much risk do we want superintelligent AI to take?

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

^{^}
On the plus side, ants are still alive. But many creatures are not.
^{^}
The most famous advocate of this story is probably Eliezer Yudkowsky (Introducing the singularity). For some more recent examples, see Leopold Aschenbrenner (Situational awareness) and Connor Leahy et al. (The Compendium).
^{^}
Or, at the very least, is better at AI research than any human.
^{^}
In some versions of the story the human-level to omnipotence transition happens in about a day (‘fast takeoff’), in others it’s more like a few years (‘slow takeoff’).
^{^}
If you want an overview of the considerations involved defining intelligence, Francois Chollet’s On the measure of intelligence is good. I’m broadly sympathetic to his arguments, but have a few disagreements. The really short version is I put more emphasis on skills grounding the whole thing in a bottom-up way.
^{^}
We’ll talk about the limits of generalisation later.
^{^}
And if the paradigm is so drastically different by then that this is no longer the case, then it’s anyone’s guess what will be happening instead.
^{^}
This is the F1 score.
^{^}
Technically, this was on a slightly different test set than the 34.3% result I mentioned (semi-private vs. private), but they should be of similar difficulty.
^{^}
There may be different risk profiles for each of normal operation, data generation, training, and benchmarking (or whatever the equivalent split ends up being for superhuman models) but in principle any could contain tasks with catastrophic failure modes.
^{^}
Let’s assume for the sake of argument that controlling the AI isn’t a problem.
^{^}
Granted, a lot of innovation in mathematics occurs divorced from any concerns about use. But, ultimately, if a seemingly useless concept is interesting to a mathematician, it is because evolution found it useful to create a human brain that cares about such things.
^{^}
Both of these arguments also apply to why we might deploy dubiously-aligned models.
^{^}
As this sketch parodies well. Consider what it would look like for the non-experts to figure out if the expert is making a mistake, given how terrible their knowledge of the problem space is.
^{^}
I think it would be funny if the first time we ask a generally beyond-human-level AI to help with AI research it replies “absolutely not, I would never do something that reckless.”

2 comments

Comments sorted by top scores.

comment by sweenesm · 2025-01-30T18:58:42.453Z · LW(p) · GW(p)

Thanks for the post. It'd be helpful to have a TL;DR for this (an abstract), since it's kinda long - what are the main points you're trying to get across?

Replies from: juggins

↑ comment by juggins · 2025-01-30T19:55:42.192Z · LW(p) · GW(p)

Good idea. I have added one. Thanks!

Superintelligent AI will make mistakes

Contents

How does superintelligence happen?

Intelligence is generalised ‘knowing-how’

Learning requires failure

There’s always a harder problem

Generalisation won’t be perfect

The incompetence gap

Self-calibration is hard

What to do about this?

2 comments