Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

post by Yoshua Bengio (yoshua-bengio), Jesse Richardson (SharkoRubio), dwk, mattmacdermott · 2025-02-24T18:31:48.580Z · LW · GW · 2 comments

Contents

  Abstract
  Executive Summary
    Highly effective AI without agency
      We foresee three primary use cases for Scientist AIs:
    Mapping out ways of losing control
    The Scientist AI research plan
  Career Opportunities at SAIFH
None
2 comments

A new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues that the current push towards building generalist AI agents presents catastrophic risks, creating a need for more caution and an alternative approach. We propose such an approach in the form of Scientist AI, a non-agentic AI system that aims to be the foundation for safe superintelligence. (Note that this paper is intended for a broad audience, including readers unfamiliar with AI safety.) 

Abstract

The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory.

Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of over-confident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

Executive Summary

Highly effective AI without agency

For decades, AI development has pursued both intelligence and agency, following human cognition as a model. Human capabilities encompass many facets including the understanding of our environment, as well as agency, i.e., the ability to change the world to achieve goals. In the pursuit of human-level performance, we are naturally encoding both intelligence and agency in our AI systems. Agency is an important attribute for the survival of living entities and would be required to perform many of the tasks that humans execute. After recent technological breakthroughs have led to large language models that demonstrate some level of general intelligence, leading AI companies are now focusing on building generalist AI agents: systems that will autonomously act, plan, and pursue goals across almost all tasks that humans can perform.

Human-like agency in AI systems could reproduce and amplify harmful human tendencies, potentially with catastrophic consequences. Through their agency and to advance their self-interest, humans can exhibit deceptive and immoral behavior. As we implement agentic AI systems, we should ask ourselves whether and how these less desirable traits will also arise in the artificial setting, especially in the case of anticipated future AI systems with intelligence comparable to humans (often called AGI, for artificial general intelligence) or superior to humans (ASI, for artificial superintelligence). Importantly, we still do not know how to set an AI agent’s goals so as to avoid unwanted behaviors. In fact, many concerns have been raised about the potential dangers and impacts from AI more broadly. Crucially, there are severe risks stemming from advances in AI that are highly associated with autonomous agents. These risks arguably extend even to human extinction, a concern expressed by many AI researchers.

Combining agency with superhuman capabilities could enable dangerous rogue AI systems. Certain capabilities – such as persuasion, deception and programming – could be learned by an AI from human behavior or emerge from reinforcement learning, a standard way of training an AI to perform novel tasks through goal-seeking behavior. Even if an AI is only imitating human goals and ways of thinking from its text completion pre-training, it could reach superior cognitive and executive capability due to advantages such as high communication bandwidth and the ability to run many instances of itself in parallel. These superhuman capabilities, if present in a generalist agent with even ordinary human self-preservation instincts or human moral flaws (let alone poorly aligned values), could present a serious danger.

Strategies to mitigate the risks of agency can be employed, including the use of non-agentic trustworthy AI as a safety guardrail. For example, we could reduce the cognitive ability of an AI by making its knowledge narrow and specialized in one domain of expertise, yielding a narrow AI system. We can reduce its potential impact in the world by reducing the scope of its actions. We can reduce its ability to hatch complex and dangerous plans by making sure it can only plan over a short horizon. We can mitigate its dangerous actions by using another AI, one that is preferably safe and trustworthy, like the non-agentic AI proposed here, as a guardrail that detects dangerous actions. This other AI is made trustworthy by training it to scientifically explain human behavior rather than imitate it, where trustworthy here means “honest”, avoiding the deceptive tendencies of modern frontier AIs. If society chooses to go ahead with building agentic AGIs in spite of the risks, a pragmatic risk management avenue would be to overlay them with such trustworthy and non-agentic guardrails, which is one of the motivations for our proposal.

With the objective to design a safer yet powerful alternative to agents, we propose “Scientist AIs” – AI systems designed for understanding rather than pursuing goals. Inspired by a platonic and idealized version of a scientist, we propose the design and construction of Scientist AIs. We do so by building on the state-of-the-art in probabilistic deep learning and inspired by the methodology of the scientific process, i.e., first understanding or modeling the world and then making probabilistic inferences based on that knowledge. We show in the paper how probabilistic predictions can be turned into experimental design, obviating the need for reinforcement learning agents in scientific discovery. In contrast to an agentic AI, which is trained to pursue a goal, a Scientist AI is trained to provide explanations for events along with their estimated probability. An agentic AI is motivated to act on the world to achieve goals, while the Scientist AI is trained to construct the best possible understanding of its data. We explain in this paper why understanding is intrinsically safer than acting.

We foresee three primary use cases for Scientist AIs:

  1. as a tool to help human scientists dramatically accelerate scientific progress, including high-reward areas like healthcare;
  2. as a guardrail to protect from unsafe agentic AIs, by double-checking actions they propose to perform and enabling their safe deployment; and
  3. as an AI research tool to help more safely build even smarter (superintelligent) AIs in the future, a task which is particularly dangerous to attempt by leveraging agentic systems.

This alternative path could allow us to harness AI’s benefits while maintaining crucial safety controls. Scientist AIs might allow us to reap the benefits of AI innovation in areas that matter most to society while avoiding major risks stemming from unintentional loss of human control. Crucially, we believe our proposed system will be able to interoperate with agentic AI systems, compute the probability of various harms that could occur from a candidate action, and decide whether or not to allow the action based on our risk tolerances. As the stakes become higher, either because of increased capabilities of the AI or because of the domains in which it is applied (e.g., involving human life in war, medical treatments or the catastrophic misuse of AI), we will need trustworthy AIs. We hope that our proposal will motivate researchers, developers and policymakers invest in safer paths such as this one.

Strategies are presented to ensure that the Scientist AI remains non-agentic. Building AI agents with superhuman intelligence before figuring out how to control them is viewed by some as analogous to the risk posed by the creation of a new species with a superhuman intellect. With this in mind, we use various methodologies, such as fixing a training objective independent of real-world interactions, or restricting to counterfactual queries, to reduce the risk of agency emerging in the Scientist AI, or it exerting influence on the world in other, more subtle ways.

Mapping out ways of losing control

Powerful AI agents pose significant risks, including loss of human control. Scenarios have been identified, without arguments proving their impossibility, that an irreversible loss of human control over agentic AI can occur, due to technical failures, corner cutting, or intentional malicious use. Making sure an AI will not cause harm is a notoriously difficult unsolved technical problem, which we illustrate below through the concepts of goal misspecification and goal misgeneralization. The less cautious the developer of the AI, e.g., because of perceived competitive pressures, the greater the risk of loss-of-control accidents. Some players may even want to intentionally develop or deploy an unaligned or dangerous ASI.

Loss of control may arise due to goal misspecification. This failure mode occurs when there are multiple interpretations of a goal, i.e., it is poorly specified or under-specified and may be pursued in a way that humans did not intend. Goal misspecification is the result of a fundamental difficulty in precisely defining what we find unacceptable in AI behavior. If an AI takes life-and-death decisions, we would like it to act ethically. It unfortunately appears impossible to formally articulate the difference between morally right and wrong behavior without enumerating all the possible cases. This is similar to the difficulty of stating laws in legal language without having any loopholes for humans to exploit. When it is in one’s interest to find a way around the law, by satisfying its letter but not its spirit, one often dedicates substantial effort to do so.

Even innocuous-seeming goals can lead agentic AI systems to dangerous instrumental subgoals such as self-preservation and power-seeking. As with Goodhart’s law, overoptimization of a goal can yield disastrous outcomes: a small ambiguity or fuzziness in the interpretation of human-specified safety instructions could be amplified by the computational capabilities given to the AI for devising its plans. Even for apparently innocuous human-provided goals, it is difficult to anticipate and prevent the AI from taking actions that cause significant harm. This can occur, for example, in pursuit of an instrumental goal (a subgoal to help accomplish the overall goal). Several arguments and case studies have been presented strongly suggesting that dangerous instrumental goals such as self-preservation and power-seeking are likely to emerge, no matter the initial goal. In this paper, we devise methods to detect and mitigate such loopholes in our goal specifications.

Even if we specify our goals perfectly, loss of control may also occur through the mechanism of goal misgeneralization. This is when an AI learns a goal that leads it to behave as intended during training and safety testing, but which diverges at deployment time. In other words, the AI’s internal representation of its goal does not align precisely – or even at all – with the goal we used to train it, despite showing the correct behavior on the training examples. 

One particularly concerning possibility is that of reward tampering. This is when an AI “cheats” by gaining control of the reward mechanism, and rewards itself handsomely. A leading AI developer has already observed (unsuccessful) such attempts from one model. In such a scenario, the AI would again be incentivised to preserve itself and attain power and resources to ensure the ongoing stream of maximal rewards. It can be shown that, if feasible, self preservation plus reward tampering is the optimal strategy for maximizing reward.

Besides unintentional accidents, some operators may want to deliberately deploy self-preserving AI systems. They might not understand the magnitude of the risk, or they might decide that deploying self-replicating agentic ASI to maximize economic or malicious impact is worth that risk (according to their own personal calculus). For others, such as those who would like to see humanity replaced by superintelligent entities, releasing self-preserving AI may in fact be desirable.

With extreme severity and unknown likelihood of catastrophic risks, the precautionary principle must be applied. The above scenarios could lead to one or more rogue AIs posing a catastrophic risk for humanity, i.e., one with very high severity if the catastrophe happens. On the other hand, it is very difficult to ascertain the likelihood of such events. This is precisely the kind of circumstance in which the precautionary principle is mandated, and has been applied in the past, in biology to manage risks from dual-use and gain-of-function research and in environmental science to manage the risks of geoengineering. When there are high-severity risks of unknown likelihood, which is the case for AGI and ASI, the common sense injunction of the precautionary principle is to proceed with sufficient caution. That means evaluating the risks carefully before taking them, thus avoiding experimenting or innovating in potentially catastrophic ways. Recent surveys suggest that a large number of machine learning researchers perceive a significant probability (greater than 10%) of catastrophic outcomes from creating ASI, including human extinction. This is also supported by the arguments presented in this paper. With such risks of non-negligible likelihood and extreme severity, it is crucial to steer our collective AI R&D efforts toward responsible approaches that minimize unacceptable risks while, ideally, preserving the benefits.

The Scientist AI research plan

Without using any equations, this paper argues that it is possible to reap many of the benefits of AI without incurring extreme risks. For example, it is not necessary to replicate human-like agency to generate scientific hypotheses and design good scientific experiments to test them. This even applies to the scientific modeling of agents, such as humans, which does not require the modeler themselves to be an agent. 

Scientist AI is trustworthy and safe by design. It provides reliable explanations for its outputs and comes with safeguards to prevent hidden agency and influence on the events it predicts. Explanations take the form of a summary, but a human or another AI can ask the system to do a deep dive into why each argument is justified, just like human scientists do among themselves when peer-reviewing each other’s claims and results. To avoid overconfident predictions, we propose to train the Scientist AI to learn how much to trust its own outputs, so that it can also be used to construct reliable safety guardrails based on quantitative assessments of risk. To counter any doubt about the possibility of a hidden agent under the hood, predictions can be made in a conjectured setting of the simulated world in which the Scientist AI either does not exist or does not affect the rest of the world. This would avoid any possible agentic effect in the AI’s forecasts, e.g., via self-fulfilling predictions, such as an AI making predictions about election results that end up influencing the outcomes. A guardrail system based on another instance of the Scientist AI itself could also be added so that if the prediction would influence the world in ways that go against ethical guidelines (such as influencing elections), then the output is not provided. Finally, we describe how the training objective can allow the Scientist AI to form an understanding of dangerous agents, including those exhibiting deception or reward tampering, and predict their behavior without itself being agentic.

Scientist AI becomes safer and more accurate with additional computing power, in contrast to current AI systems. The Scientist AI is meant to compute conditional probabilities, i.e., the probability of an answer or an interpretation being true or an event happening, given some question and context. It is trained by optimizing a training objective over possible explanations of the observed data which has a single optimal solution to this computational problem. The more computing power (“compute”) is available, the more likely it is that this unique solution will be approached closely. Crucially, this is in contrast with experimental evidence showing that current AI systems tend to become more susceptible to misalignment and deceptive behavior as they are trained with more compute, as well as theoretical evidence that misalignment is likely to emerge specifically in AI agents that are sufficiently advanced. There is already a rich scientific literature showing different training objectives which have as a unique global optimum the desired and well-defined conditional probabilities. These could be used to compute the probability of any answer to any question if the objective has been fully optimized, which may in general require very large compute resources, but can otherwise be approximated with more modest resources. This allows us to obtain hard safety guarantees asymptotically as the amount of compute is increased. This does not change the fact that more data or data that is more informative would reduce the uncertainty expressed by those probabilities. As usual, more and better data would allow the model to discover aspects of the world that may otherwise remain invisible.

While Scientist AI is intended to prevent accidental loss of control, further measures are needed to prevent misuse. Bad actors could for example decide to turn the non-agentic AI into an unguarded agent, maybe for military or economic purposes. If done without the proper societal guardrails, this could yield loss of human control. This transformation from non-agentic to agentic can be done by asking the Scientist AI what one should do to achieve some goal, for example how to build a dangerous new weapon, and by continuously feeding the AI with the observations that follow from each of its actions. These types of issues must be dealt with through technical guardrails derived from the Scientist AI, through the security measures surrounding the use of the Scientist AI, and through legal and regulatory means.

To address the uncertainty in the timeline to AGI, we adopt an anytime preparedness strategy. We structure our research plan with a tiered approach, featuring progressively safer yet more ambitious solutions for different time horizons. The objective is to hedge our bets and allocate resources to both short-term and long-term efforts in parallel rather than only start the long-term plans when the short-term ones are completed, so as to be ready with improved solutions at any time compared with a previous time point.

You can read the full paper here

Career Opportunities at SAIFH

If you are interested in working on this research agenda, we are currently hiring for an ML Research Developer position, apply here (French appears first, scroll down for English). We are also open to expressions of interest from individuals with backgrounds in machine learning research & engineering, as well as AI safety. If that's you, please reach out here.

2 comments

Comments sorted by top scores.

comment by jbash · 2025-02-24T19:03:02.346Z · LW(p) · GW(p)

If you're planning to actually do the experiments it suggests, or indeed act on any advice it gives in any way, then it's an agent.

Replies from: mattmacdermott
comment by mattmacdermott · 2025-02-24T21:14:53.961Z · LW(p) · GW(p)

Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?

(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)