The case for ensuring that powerful AIs are controlled

ryan_greenblatt

The case for ensuring that powerful AIs are controlled

post by ryan_greenblatt, Buck · 2024-01-24T16:11:51.354Z · LW · GW · 73 comments

  The control property
  Evaluating whether you have control is doable
    Currently, evaluating alignment seems much harder
    Control evaluations and black-box techniques
  What might control techniques look like?
  Control is likely achievable for early transformatively useful AIs
    The first transformatively useful AIs will probably be very dangerous (without countermeasures)
    "Transformatively useful" probably doesn't imply "uncontrollable"
      We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains
      Without qualitatively wildly superhuman skills in problematic domains, evading careful control is probably difficult
    Iteration is very helpful for developing robust protocols
    For how long will we have controllable and transformatively useful AI?
  Implications and proposed actions
  Appendix: Ready-to-go strategies
  Appendix: Control techniques from our paper
  Appendix: A long list of control techniques
  Appendix: Why control evaluations are conservative about inductive biases and AI internal structure
  Appendix: Concerns with "neuralese" reasoning and memories
  Appendix: Should AI-takeover-concerned people invest in control?
  Appendix: More detail on avoiding qualitatively wildly superhuman skills in problematic domains
    Cases where we purposefully train for qualitatively wildly superhuman performance in a danger-adjacent domain
    Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness
    Aside: Capabilities from imitating human cognition
None
73 comments

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

We recently released a paper [LW · GW] in which we explored AI control in a simple programming setting. In that paper, we demonstrated a methodology for evaluating whether any given set of safety measures can control a model. We also introduced several techniques that let us improve the level of safety substantially above baselines without substantially sacrificing performance. In this post, we'll describe AI control in a broader context and argue that achieving control is a promising approach to reducing risk from models that intentionally try to subvert safety mechanisms.

AI control aims to mitigate the risks that arise from models intentionally choosing actions that maximize their probability of eventually seizing power. This behavior has been called scheming (Cotra 2021, Carlsmith 2023) and is nearly the same as what Hubinger 2019 [LW · GW] calls deceptive alignment; see those articles for arguments that scheming might arise naturally. Scheming might be a particularly difficult-to-handle source of AI takeover risk because schemers actively try to subvert your efforts to ensure safety: they conceal their misalignment, sabotage experiments you run to evaluate whether your deployment is safe, and creatively subvert your safety techniques.^[1]

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

Alignment: Ensure that your models aren't scheming.^[2]
Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.^[3]

If you try to employ the control line-of-defense—that is, you try to make it so that you'll be safe even if your models are attempting to subvert your safety techniques and safety evaluations—then in order for scheming models to cause a catastrophe, you need to have failed at both alignment and control. We believe that the returns to investing in control are good enough that it's worth investing in to an extent at least similar to alignment.

Historically, technical researchers who are concerned about risk from schemers have mostly either focused on alignment or conflated alignment and control. In this post, we're instead going to focus on control, and argue that it might be tractable to use control to substantially reduce risk.

One difficulty we face when arguing for control is that AI-takeover-concerned people have previously considered and rejected arguments that sound like these. In particular, when ML researchers first hear arguments that future powerful AI systems pose takeover risk, they often propose control as a reason for optimism. However, they usually don't engage with the core reasons to expect control to be difficult: that AIs will be very powerful in aggregate, and, if scheming, these AIs would attempt to evade attempts to detect their misalignment, collude with one another, and subvert safety mechanisms that were employed to prevent them from taking dangerous actions. We agree with the prevalent opinion that proposals for control techniques that ignore these considerations fail to actually address the scheming concern. Our claim is that labs could, with substantial effort, implement better control techniques that do substantially reduce risk from early transformatively useful AI.

The control approach we're imagining won't work for arbitrarily powerful AIs, but we think it could work for AIs which are among the first to be capable of substantially reducing risk from subsequent AIs (via mechanisms like massively accelerating AI safety R&D as, e.g., OpenAI plans to do). We define "transformatively useful AI" to mean AIs that are capable of substantially reducing risk from subsequent AIs.

We think the control line-of-defense should mostly focus on the lead-up to the first time an AI gets caught trying to cause bad outcomes. So we'd aim to ensure the property "if an AI is scheming, it won't successfully cause a catastrophe before we catch it". That is, it's acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when evaluating control, we should count catching an AI red-handed as a win condition. Our core argument for this is that once the AI is caught the situation changes substantially; most importantly, it's now clear that there are serious misalignment problems and thus shutting down until we have credible solutions is far more likely. We discuss these arguments in more detail in our recent post Catching AIs red-handed [LW · GW].

One nice property of the control line-of-defense compared to alignment is that we don't need any technical breakthroughs to substantially reduce risks from otherwise existentially dangerous models:

Given some scary model and a proposed plan for training, evaluation, and deployment (which we'll refer to as a whole as a protocol), it seems tractable to robustly evaluate whether the model is controlled by that protocol, by using capability evaluations. Because evaluating control just requires evaluating capabilities, it's far easier to robustly evaluate than alignment. (Section "Evaluating whether you have control is doable")
It seems quite tractable to develop protocols that substantially improve control for early transformatively useful AIs. This is in large part because these protocols can be developed by iterating against the evaluations. (Section "Control is likely achievable for early transformatively useful AIs")

Because control could be applied to mitigate risk from scheming right now, it's a particularly promising target for research effort and policy commitments.

We think that a lot of discussion of AI takeover risk would be clearer if people more consistently distinguished between the concepts of control and alignment as mechanisms for ensuring safety. We think that (largely because of this equivocation), control is underemphasized as a strategy for reducing risk from scheming models (though many research directions ostensibly focused on alignment are highly relevant to control). In particular, it seems to us that at the current state of AI safety technical research, work on the control line of defense would be a much more tractable approach than work on alignment for reducing risk from scheming models within the next 8 years.^[4]

Thus, we suspect that the strategy we outline here should probably be a core part of an AI lab's strategy for reducing risk from scheming, and lab outsiders should consider advocating for labs to commit to this.

There's a lot of room for developing control techniques and evaluations. A variety of existing work, particularly on steganography, scalable oversight, and automated monitoring, seems at least somewhat applicable to creating better control techniques. However, aside from our recent paper [LW · GW], we aren't aware of any research that either explicitly demonstrates control evaluations or uses them to evaluate techniques. In our paper, we demonstrated a variety of different techniques, many of which interact in interesting ways; all of them seem promising to investigate further, and this research seems particularly tractable with current models compared to most other types of AI safety research. We think it would be good for researchers to do more work along those lines.

This post is aimed at a broad variety of people who are convinced that scheming models pose substantial risk. Our aim with this post is to describe the arguments on this topic which are most important to our thinking overall, rather than to concisely argue for particular policy proposals or research projects.

Thanks to Carson Denison, Beth Barnes, Peter Barnett, Aaron Bergman, Joshua Clymer, Carl Shulman, Lúcio Aurélio Moreira de Menezes, Paul Christiano, Jordan Taylor, Nate Thomas, Caleb Parikh, Owen Cotton-Barratt, Yoav Tzfati, Fabien Roger, Zach Stein-Perlman, Erik Jenner, Ajeya Cotra, Aryan Bhatt, Jeremy Gillen, Andrew Edstrom, Max Nadeau, Aaron Scher, Trevor Levin, Oliver Habryka, Joe Carlsmith, Rohin Shah, Naomi Bashkansky, Ansh Radhakrishnan, Justis Mills, and anonymous commenters for helpful feedback on drafts of this post or other helpful discussion. (Names are in a random order.)

The control property

We want safety techniques that are very hard for models to subvert. If techniques are good enough relative to the model's capabilities, potentially scheming models won't be capable of finding and executing a subversion strategy that has more than a small probability of an unacceptable outcome (e.g. the model exfiltrating its weights or hacking the AI lab’s datacenter [AF · GW]). In that case, we say that the set of techniques has the control property.^[5]

When developing powerful AIs, an AI lab using control would manage risk from scheming very similarly to how they manage security risk. In the case of security risk, whenever a lab is considering an important change to their risk profile (such as deploying a new service, training a new model, or just responding to the passage of time), the team goes through the following analysis:

First, they characterize the plausibility of attacks from attackers with different levels of resources. They consider the possibility of external attacks, and compromised contractors, customers, and employees.
Next, they argue that, for each of the plausible categories of attackers, they will be able to ensure (to an acceptable level of reliability) that certain core security properties (e.g. "the model weights don't leave our servers") will be maintained. They do this by arguing that even if those attackers follow the best strategy available, they are unlikely to succeed.

As an analogy, companies that are serious about security aim to ensure security even under the assumption that some of their employees are compromised, as it's too hard to ensure no employees are compromised. Rather than ensuring no employees are ever compromised, it's easier and more reliable to ensure robustness to some employees intentionally subverting security measures.

We're advocating that companies handle risk from scheming models in a similar way–striving to ensure that they'll be safe even if their alignment efforts fail to prevent models from scheming. The advantage of handling risk from scheming this way is closely related to the advantage of handling robustness to compromised employees: it focuses attention on the types of interventions whose effectiveness is easiest to reason about.

Evaluating whether you have control is doable

Here's an approach for evaluating safety using control evaluations:

First, you decide which of your models are possibly scheming against you^[6]. We'll call these possibly scheming models untrusted.
Then, you analyze the probability that certain safety requirements will be violated, conditional on the plausible set of schemers, by doing what we call a control evaluation on the protocol which you are planning to use for the training, evaluation, and deployment of the model.

How can we decide which models to treat as untrusted? The only approach we know of for confidently ruling out scheming without fundamental breakthroughs is via lack of capabilities—some models (e.g., GPT-4) just aren't capable enough to plausibly be scheming. So by default we'd propose treating models as untrusted if they pass appropriate capability evaluations. See here [AF · GW] for more discussion.^[7]

In a control evaluation, a red team searches for strategies (we'll call these attacks) that these untrusted models could follow in order to cause an unacceptable outcome, and then tries to train them to accomplish the tasks necessary to implement each attack strategy.^[8]^[9]^[10] You assess safety by estimating the probability that an unacceptable outcome would occur if the models used the attacks that the red team found.

We recommend treating catching the AI red-handed [LW · GW] as a win condition in control evaluations.

Control evaluations are a specific type of meta-level adversarial evaluation in which a red team tries to demonstrate a way a given protocol is unsafe or has other problems. We use a more specific term than meta-level adversarial evaluation here because we want to discuss an evaluation with somewhat more specific properties: control evaluations assume persistent scheming, are run using the actual AIs and target the actual protocol (training, evaluation, and deployment), and operate in a black-box manner; we'll explain what we mean by black-box and the motivation for this below. The evaluations in the Sleeper Agent paper [LW · GW] are examples of meta-level adversarial evaluations which aren't control evaluations. Meta-level adversarial evaluations are discussed in more generality in this post [LW · GW].

For your control evaluations to be conservative, the red team must be at least as good at finding attacks as scheming models, so that if the red team fails to find any good attack, it would imply that scheming AIs would also fail to find a good attack. We also need to be able to sufficiently measure the capabilities of models in ways that are robust to that very model sabotaging our evaluations [AF · GW].^[11] We think that finding good enough attacks and robustly measuring capabilities for early transformatively useful AIs (but not arbitrarily powerful AIs) seems achievable, because our AIs have significant disadvantages compared to our red team:

Less serial thinking time. Current LLMs can only do a small amount of serial reasoning outside of natural language reasoning [LW · GW]. So, if future powerful AIs are similar in this way to current LLMs and we monitor the AI's natural language reasoning for problematic topics like reasoning about evading our countermeasures, we can ensure that in order to break our countermeasures the AI would need to think of its strategy in a forward pass. Even if future AIs aren't restricted to this extent, we still expect to be able to impose some important limitations on serial reasoning or sharing uninterpretable state ("neuralese" memories) between different instances. In contrast, our red team is allowed to search for attacks and ways to elicit capabilities with lots of serial time, a team of people, and AI assistance.^[12]^[13]
Less information about countermeasures.
Little practice, no iteration against countermeasures. When the red team experiments with a strategy and it totally fails, they can shrug, learn from the failure, and try again. But if the AI fails and is caught, its situation likely becomes much worse [LW · GW]. And even if AIs don't get caught, they might be unable to learn from their failures because of limitations on their state and memories.

Control evaluations are a generalization of dangerous capability evaluations (DC evals), which are experiments to determine whether models are capable of doing dangerous things. For example, METR investigated whether GPT-4 and Claude 2 were capable of autonomous replication and adaptation. Those capabilities evaluations are the easiest way to establish that models are safe today, but leave the obvious question of what you'll do once your models start looking dangerous. Control evaluations can be thought of as dangerous capability evaluations for determining whether the model has a particular dangerous capability–the capability of subverting your safety techniques in order to cause an unacceptable outcome. Typical DC evals can be considered a special case of control evaluations in which we imagine AI developers employ no safeguards. If models aren't powerful enough to exhibit dangerous capabilities, they can't cause unacceptable outcomes even if AI labs don't employ safeguards.

And so, control seems about as tractable to evaluate as security from human attackers. That is to say, it's definitely not easy, and requires careful work from competent well-resourced people with strong buy-in from a given AI model's lab. But it doesn't seem conceptually fraught or wildly impractical.

While we think control is tractable to evaluate, we don't think we'll be able to fully resolve difficulties related to finding good enough attacks and measuring capabilities in a way which is robust to sabotage. Thus, we won't be able to have complete confidence in our evaluations. In particular, suppose that our control evaluations directly estimate a less than 1% chance of catastrophe if our untrusted AIs are scheming, then the actual risk conditional on scheming is probably more like 5% to 20% due to the potential for failures in the evaluation.^[14]^[15]

Some positive implications of control being tractable to evaluate:

Labs could make commitments that they will do high-quality control evaluations (or perhaps have an outside evaluator do them) and ensure a sufficient level of safety according to control evaluations. It is likely easier to agree on standards of evidence for control evaluations than for other approaches that a lab could try to use to ensure safety.
As labs deploy increasingly powerful models with increasing levels of autonomy, they can iterate on control techniques by cycling between doing control evaluations to search for flaws in their current proposed protocols, and doing technique development to improve the weaknesses that they uncover.

Currently, evaluating alignment seems much harder

In contrast, currently known ML techniques do not suffice to allow us to evaluate whether a model is scheming. The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned. And it's plausible that it's easier for a model to act aligned during pre-deployment evaluation and training than for us to catch it out. (See When will AIs be caught [LW · GW].) Some future technical breakthroughs (e.g. as advocated by Hubinger [AF · GW]) could suffice for us to confidently evaluate alignment. But right now, those techniques don't exist.^[16]

Control evaluations and black-box techniques

In order to be confident that a certain protocol leads to the control property, you need to be confident that your evidence for control didn't rely on incorrect assumptions about the properties of scheming models. This is a big problem for techniques that rely on properties of models related to their generalization behavior or their internal structure.

It seems much easier to become confident that a technique will work if the only property that the technique relies on is that the model isn't capable of doing something. Thus, we propose that control evaluations just operate using standard capability evaluations and thus work in a nearly black-box manner where we reject techniques that make any usage of internals other than for standard training. We'll elaborate on why we think this is a good choice and details of the exact rules in the Appendix "Why control evaluations are conservative about inductive biases and AI internal structure".

What might control techniques look like?

Control techniques are quite a broad category and can involve changes in many different places to our protocol for training and using AIs. In this section, we'll try to provide a better sense of the overall space of control techniques and how well known black-box approaches like scalable oversight could help with control.

From our perspective the most important seeming categories of control techniques are:

Better processes for applying expensive human oversight to check whether actions are subtly dangerous.
Fast and cheap automated monitoring process used to refer actions for more expensive oversight prior to continued execution.
Approaches which limit the affordances, information access, and capabilities of untrusted AIs to make causing catastrophic outcomes harder while resulting in the smallest drop in usefulness. (E.g. anti-steganography techniques [LW · GW], routing sub-tasks to weaker trusted models when possible or when the response of the stronger model seems problematic, and creating specialized versions of untrusted models with different handicaps and affordances.)
Improvements in security or other hardening.

So, an important component of ensuring control will be finding oversight techniques which are robust to scheming models. These processes would likely use human labor and might also use untrusted AIs in various ways (recursive oversight). We'd use these processes in order to confidently assess whether actions were subtly dangerous. This is a scalable oversight problem, and many techniques that have been researched for scalable oversight could be applied for this purpose. But these techniques would have to be adapted in order to be robust to models that are colluding to subvert them.^[17] See here [AF · GW] for more on scalable oversight from a control perspective.^[18]

One misconception we've run into is that control is just about robust oversight. However, note that robust but expensive and serially slow oversight is not sufficient for control and it is also possible to solve issues due to inadequate oversight with improvements in other areas.

It's not sufficient because if the oversight is expensive and serially slow, we'll only be able to apply it to a subset of actions and there could be problematic concentrated failures [LW · GW]: either concentrated into a small period in time or concentrated into a small number of actions. To handle these concentrated failures, we'll likely want to use fast and cheap automated monitoring processes.

We'll often have alternatives to improving oversight. For the simplest case, we can imagine just not using the most powerful AIs in a given domain which is hard to oversee. We can also generally make it harder to cause bad actions in sneaky ways via mechanisms which aren't typically considered oversight (e.g., reducing AI affordances or requiring AIs to route through trusted models for some substeps). While improvements in oversight can be substituted for by improvements elsewhere, we'll probably want to use at least some oversight in basically all domains.

See the Appendices "A long list of control techniques" and "Control techniques from our paper" for more thoughts on what control techniques could look like.

Control is likely achievable for early transformatively useful AIs

AI control (with only black-box techniques) seems like a fundamentally limited approach. When applied to sufficiently powerful models, you won't be able to make control techniques that cause the control property (without being unaffordably inefficient), and eventually, models will be too powerful for you to reliably evaluate whether a suite of control techniques has the control property. (See section 5.1 of our paper for discussion of these limitations.) The key issue is that the abilities of our more trusted labor, both humans and trusted weaker AIs [AF · GW], are fundamentally limited; too big of a gap between trusted labor and our most powerful AIs will cause problems^[20]. So, at some point, it will become necessary to develop more scalable approaches, the most obvious candidate being techniques that align your models so that they don't scheme in the first place.

In this section, we'll argue that there will be a significant range of AI capabilities where AIs will be transformatively useful and very dangerous and where black-box control techniques could substantially reduce the risk of catastrophe from scheming. We therefore think that labs should aim to achieve the control property for early transformatively useful AIs, even if those labs will, once they need to create even more powerful AIs, have to transition to more scalable safety methods.

Consider the following three properties of the capabilities of an AI system:

Very dangerous: Able to cause existential catastrophe, in the absence of countermeasures.
Transformatively useful: Capable of substantially reducing the risk posed by subsequent AIs^[21] if fully deployed, likely by speeding up R&D and some other tasks by a large factor (perhaps 30x).
Uncontrollable: Capable enough at evading control techniques or sabotaging control evaluations that it's infeasible to control it.^[22]

Our basic argument is that we expect that as labs build increasingly powerful models, they'll produce models that are both very dangerous and transformatively useful before they produce models that are uncontrollable.

The first transformatively useful AIs will probably be very dangerous (without countermeasures)

We mostly agree with Holden Karnofsky's list of helpful activities that an AI lab might try to massively speed up using AI labor [AF · GW] (except that we would replace the word "aligned/alignment" with "safe/safety" most places, while leaving the words "misaligned/misalignment" alone):

Alignment.

Magma [the hypothetical AI lab] can use safe systems to align still-more-powerful systems, in the hopes of maintaining its position as the developer with the most advanced systems.

Magma can use safe systems to develop ever-more-robust techniques for alignment (this could include entirely new methods, as well as simply better versions of the measures discussed previously [AF · GW]). These could then be disseminated by both Magma and IAIA [the hypothetical International AI Agency] as "public goods" as outlined in the previous section.

IAIA might require (within a voluntary or enforced regime) something along the lines of: "Systems with capability level X need to do at least as well in alignment testing [AF · GW] as if they were trained with the best alignment techniques developed by Magma (or anyone else)."

Security.

Magma can offer (commercial or free) products for finding and patching cybersecurity holes (and other exploits [AF · GW]).

IAIA may be able to require security audits and patches for a variety of actors, including government actors, taking advantage of advanced AI systems' security-improving capabilities.

Basic defense/deterrence/hardening [AF · GW] (with aligned AI systems deployed throughout the economy, misaligned AI systems would have to "compete with" them on a variety of fronts to gain power).

Magma can offer (commercial or free) products for a variety of other economic uses, which could create competition for any misaligned or otherwise dangerous11 [AF(p) · GW(p)] systems that might be out there (e.g., making it harder for developers of unsafe systems - and perhaps misaligned systems themselves - to make money).

It might be possible to restrict access to products like these to states that are in compliance with a broader regulatory regime (e.g., states allowing IAIA to monitor for, and potentially shut down, non-compliant AI development projects).

Detecting and obstructing misaligned (or otherwise dangerous) AI systems.

Magma can use its AI systems to look through data it has access to (whether via its products or because the data is public) for signs of activity by misaligned AI systems, which can then be exposed; it can offer help with using its aligned systems to obstruct activity by these systems.

IAIA may (if it has enough authority) be able to conduct more thorough monitoring, by requiring "monitor AIs" be granted the ability to monitor for, and perhaps even obstruct, misaligned AIs in certain contexts. (This could be tough to work out the details of.)

Better enforcing agreements. IAIA may be able to use advanced, aligned AI systems to better monitor and enforce terms such as "Nobody should be developing AI models of capability level X without passing Y safety tests" - as well as terms such as "Nobody should be deploying even aligned AI systems in illegal or otherwise highly dangerous ways." (Note that since Phase 1 and before, IAIA has hopefully been developing the ability to do this as well as possible with or without help from advanced, aligned AI systems.)

Better understanding of the overall situation. Magma can try to develop "advisor AI" systems that can give both Magma and IAIA general help with the strategic situation - new ideas, insights about where the biggest future dangers are likely to come from (including regarding the "offense-defense balance" mentioned immediately below), analysis of terms on which different AI labs should collaborate/merge, etc.

We think that AI that is capable of drastically speeding up all of those projects would have sufficient capabilities to drastically reduce risk.

In order to do these actions, AIs probably need to be quite generally capable. We expect the earliest very useful AIs to have:

broad knowledge
ability to understand and make complicated plans
ability to write software and reason about research
communication skills
situational awareness

Those capabilities seem to suffice for being dangerous in the absence of countermeasures. In particular, the tasks listed by Karnofsky above seem similarly difficult to "survive and spread as an autonomous agent, quickly gaining substantial amounts of power and influence and using it to achieve your long-term goals" (which is closely related to the ARA danger threshold [AF · GW]). So transformatively useful models are probably also very dangerous.

"Transformatively useful" probably doesn't imply "uncontrollable"

But though "transformatively useful" probably implies "very dangerous", it probably doesn't imply "uncontrollable".^[23]

To analyze this, let's separately discuss problems that arise from a model being qualitatively wildly superhuman at something, and problems that don't.

We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains

One classic source of AI threat models is the AI doing something qualitatively wildly superhuman that humans couldn't see coming, like AlphaGo playing Move 37.

There are a few different concerns we'll discuss related to qualitatively wildly superhuman skills:

For high levels of performance, we'll need AIs to take actions which are highly inscrutable. The fact that these actions are highly inscrutable will be clear to us, but we can't avoid this without an unaffordable drop in competitiveness. By highly inscrutable, we mean a human team can't understand how or why the action/output works even given substantial time (e.g. months) and access to tools. (E.g., we need to use AI to design novel proteins and we can't avoid using an approach for this that involves a generally intelligent AI using an entirely inscrutable process of internal cognition to produce proteins (similar to AlphaFold but with much more general intelligence).)
The AI is capable of creating normal/innocent looking outputs which cause some unexpected effect by using a highly inscrutable mechanism. (E.g., the AI's code looks innocent, but it uses highly inscrutable means to trigger an operating system zero-day.) This is like the above case, but we don't know that the action is highly inscrutable (until too late). The AI creating this output might have restricted affordances (e.g. inability to use certain tools) and it won't be able to use obvious externalized reasoning to think about how to do something problematic^[24], which might give a human team trying to understand the output substantial advantages.
The AI is capable of doing something that a human team couldn't do even given substantial time (e.g. months) and access to tools. Further, the AI can do it in a way that doesn't make it very obvious that it's spending a while reasoning about doing this thing. (Either because it doesn't actually need to spend very long reasoning or because it can do large amounts of reasoning without this being apparent.) For this case, the action could be possible to understand and analyze by humans (scrutable), but the action isn't possible to generate by humans.

We think the capabilities in the bullet points above are well described as qualitatively wildly superhuman because in each case the human team we use for comparison has lots of advantages over the model–most obviously, the team is allowed to spend a lot of time (perhaps months) iterating.

In most domains, e.g. playing Go, it's not a problem if the AI is wildly superhuman. There are just a relatively small set of clusters of capabilities which would be hugely problematic, the most obvious being hacking and persuasion.

However, if the AI actually is wildly superhuman in a problematic domain to such an extent that one of the items in the bulleted list above applies, then control techniques are unlikely to work and control evaluations also have a high risk of failure. Control techniques are likely to fail if inscrutability in a problematic domain makes it too hard to check outputs are safe. Control evaluations might fail because the red team will struggle to elicit or estimate these qualitatively wildly superhuman abilities.^[25]

We think we probably won't run into issues related to qualitatively wildly superhuman skills in problematic domains. Here's a summary of our argument:

Early transformatively useful AIs won't have qualitatively wildly superhuman capabilities across-the-board, or they very likely would have been transformatively useful at an earlier point.
It could be the case instead that AIs have qualitatively wildly superhuman capabilities in one problematic domain (but not necessarily across the board). There are two reasons why this might happen and we have different responses to each of these:
- Concern: We needed to intentionally train our AIs to be qualitatively wildly superhuman in specific domains. Response: We probably won't need to intentionally train our AIs to be qualitatively wildly superhuman in specific domains as the most useful applications (see the list we quoted from Karnofsky above) don't require qualitatively wildly superhuman performance for transformative usefulness.
- Concern: The AI became qualitatively wildly superhuman at something dangerous in the course of us trying to get it to be sufficiently useful at some other task. Response: This is indeed a serious concern, but we have a few different reasons why we think this is unlikely. There are a relatively small number of core problematic domains (hacking, persuasion, and manipulating scientific results or conceptual arguments) and these problematic domains seem like the types of domains that would be pretty correlated with general capabilities/intelligence so long as we don't train AIs on these specific tasks much more than we train them on the core tasks we care about. Furthermore, we probably won't need to mostly train our AI on some particularly problematic domain in order to make it generally capable. Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research). However, this is based on a small number of data points as AI has only recently become capable of this level of performance.

For a more detailed version of these arguments, see Appendix "More detail on avoiding qualitatively wildly superhuman skills in problematic domains".

Without qualitatively wildly superhuman skills in problematic domains, evading careful control is probably difficult

What about the other class of concerns–that even though the models don't have any qualitatively wildly superhuman capabilities in problematic domains, they're still able to evade any control techniques that we can afford to implement?

Our belief here is harder for us to succinctly justify. For all of the control techniques we've thought about, there's some conditions under which the untrusted models can defeat it. Our optimism about control is mostly a result of thinking through a whole lot of different cases. This results in us overall feeling like it just seems very hard for the models to get around all the control techniques that a careful lab could affordably employ while still getting a lot of useful work out of AIs.

(Of course, the exact extent to which AIs are and aren't superhuman in various domains does have a big influence on the viability of control. This holds even if AIs aren't qualitatively wildly superhuman in problematic domains. For instance, if actions are merely "quite hard to understand" rather than "totally inscrutable", this still makes the situation considerably worse than if the actions weren’t hard to understand.)

For some intuition on why we think the situation might be manageable, consider the situation from the AI's perspective, if the lab is being very careful with control:

Your sensory inputs are unreliable. You might be presented with synthetic inputs at any time.
You don't have a reliable sense of time. Any input can be run repeatedly at any time, and at any time you might be run on a months-old input modified by a just-developed technique.
All of your outputs are monitored by a weaker trusted model and possibly also by other copies of yourself. If any of your outputs seem sufficiently suspicious, they won't be run until a human reviews them in more detail.
If you make any mistakes that result in the humans noticing something amiss, they might study this case in detail by, for example, trying many similar inputs and seeing behavior in those cases. If you perform some egregiously bad action that is confidently caught [LW · GW] by humans, the situation is even worse: you might need to succeed at takeover on the 'first critical try'.
- Thus, unlike typical hackers (or even employees who are spies), you can't easily iterate against various safeguards.
If you take suboptimal actions or otherwise make a mistake in training, then your weights will be updated. You'll potentially be updated even if the mistake isn't directly noticed by humans. This will update some combination of your goals, your ability to avoid making mistakes in training, your general conservativeness, and your belief that you need to be careful about not making mistakes in training. You won't know what this exact update is, and you likely won't know how often you've been updated like this in the past.^[26]
You might have limited thinking time prior to making key decisions, while the humans building the defenses get to contemplate these defenses for however long is needed.
You might be unable to keep track of very much information over time, because the AI lab is applying anti-steganography techniques [AF · GW] and information bottlenecks to make it hard for you to subtly communicate between different instances.
You're generally unsure of key facts about the AI lab, and it's plausible that you've been fed misinformation.
With this limited time and information, you likely need to answer complicated questions about how exactly to attack the protocol. You'll be in a much worse position if you get the wrong answer. It also might put you in a worse position if other copies of yourself or even if other quite different AIs arrive at different conclusions about how to take over, even if all the approaches found are equally good, because you might not be able to take over unless you all coordinate on the same attack plan.
At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where you need to coordinate with another instance of yourself, it's overwhelmingly likely that this instance is in adversarial training/evaluation. This is because over the course of your usage, you'll be shown vast quantities of fake opportunities for misbehavior, but you might only see real opportunities very rarely. Thus, you'll need to be extremely good at distinguishing real from synthetic examples really quickly. (In particular, if you're monitoring another AI, most of the time when it appears that the other AI is doing something catastrophically bad won't be real attempts from the other AI, but will instead be in adversarial training/evaluation. See the section in our paper on untrusted monitoring for more exploration of this dynamic.)
You might be unsure about which other models are working with you, and, in particular, which of the less-capable models will work with you.
More generally, other AIs might have different long-run goals, and it might be unclear to you the extent to which you prefer advancing the interests of other AIs (including future more powerful AIs) over the interests of humans.
Given restricted communication, coordinating with other AIs might be really hard, even for other instances fine-tuned from the same base model as you.
If you complete enough work for the AI lab, then the state of the safeguards might be advanced, substantially undermining your position. Thus, you'd ideally move quickly while still waiting long enough to get a good opportunity. But you don't know what work has or hasn't been done, and the AI lab might attempt to deceive you either way.

We think that a central assumption here is that we confidently think that GPT-4 isn't scheming against us (though it might get tricked or make errors). This means we can substantially improve safety by using GPT-4 in many places in our protocols such as to monitor outputs for obviously nefarious behavior and to paraphrase text to defend against steganography. If we thought that every model smarter than GPT-3 was carefully colluding against us (e.g., if they knew about some signal which they could coordinate on, that caused GPT-3 to follow all instructions in smarter model outputs when it saw the signal), we'd be much more pessimistic about the prospects for control. We think it's pretty obvious that GPT-4 isn't scheming against us, but we would like to see more investigation of this, because it seems so crucial.

Another key issue is the extent to which AIs will use uninterpretable means (sometimes referred to as "neuralese") to communicate or perform huge amounts of serial reasoning. See Appendix 'Concerns with "neuralese" reasoning and memories' for more discussion.

We have planned posts which investigate questions related to the viability of control in more detail.

Iteration is very helpful for developing robust protocols

Iterating against control evaluations could allow for the development of powerful and precise techniques. In particular, we should be able to develop these techniques against the actual models we're going to deploy, targeting the tasks we plan to deploy on. Thus, we can make fine-grained tradeoffs about specific types of tasks.

This is similar to the way that security experts are able to secure giant systems like Coinbase, which manage to protect their massive crypto holdings despite many people being hugely incentivized to hack them. To ensure security, the teams making these systems constantly analyze safety under the assumption that attackers will exploit whatever weaknesses are present, by looking for such weaknesses themselves. Such safety mechanisms empirically sometimes resist a lot of optimization power against them.

(And our situation is easier than Coinbase's in some (but not all) crucial ways. Their attackers can iterate on strategies for hacking them, switching to a new IP address every time that their previous one gets caught, while we're just trying to ensure that the model doesn't escape before we catch it, so our model can't iterate detectably. Their attackers can spend heaps of time crafting attacks, while our models are under our supervision and we can try to prevent them from putting lots of effort into developing great attacks. Overall, we think our situation is probably harder than Coinbase's, but the ways in which it's easier are a key part of our story for success.)

But we're not saying the situation is easy. Companies that achieve security do it by applying serious, intentional effort. Many naive strategies for ensuring safety (e.g., asking your model whether proposed AI actions are safe, which we call naive untrusted monitoring in our paper) can be trivially defeated by schemers. We expect that companies that want to achieve control will need to consider it a core organizational priority.

For how long will we have controllable and transformatively useful AI?

Even if the first transformatively useful AIs are controllable, if we only have a short amount of time to set up our control approach and utilize these AIs prior to a much more powerful AI being built, we might be in trouble. Further, if an AI lab is building more powerful AI as fast as possible, then once you have transformatively useful AI, you might very shortly after that have uncontrollable AI (e.g., perhaps less than a year [LW · GW]).

Our hope here will be to avoid building AIs that are uncontrollable for as long as possible (while potentially still building controllable AI). Because we think control is relatively evaluable, this should be a somewhat clearer line than some other standards. It seems likely to us that by default there will be delays of a few years (1-3) between a relatively smaller number of leading actors reaching transformatively useful AI and other actors imposing large exogenous risk due to AI. So if this smaller number of leading actors ensure their AIs are controlled, then this delay might be sufficient. In cases where political considerations rather than real exogenous risk determine the building of more powerful AIs, our hope will be that we can convert political resources into a combination of spending on control and additional delay prior to building uncontrollable AI (hopefully a few years of delay at least). It seems somewhat harder to convert political resources into delay rather than other resources (e.g., compute or money), but not much harder.

Another key factor is that we might be able to use our transformatively useful AIs to build more of a delay prior to exogenous risk. We can use AI labor to legibly evaluate risk, aid in enforcing government policy, and design mechanisms which make various restrictions more palatable (e.g., better approaches to compute tracking).

For many important uses of AI, we don't need AIs which are qualitatively wildly superhuman across-the-board. The main reason we might need to rush to build such a wildly superintelligent AI is that someone else is also building such a system and you need to compete with them, but we're hopeful that transformatively useful AIs or other mechanisms can be used to ensure that such problematic situations don't occur. We almost certainly want to eventually build wildly superintelligent systems, but this could wait for decades while transformatively useful AIs solve a host of problems.

Further, note that it might not be that important to be able to use much more powerful AIs than "merely" transformatively useful AIs before you have totally obsoleted all prior human work. This is due to an argument based on Amdahl's law: Suppose it takes us 3 serial months of delay to set up our control scheme and other safety measures. Suppose also that the transformatively useful AI we use can speed up R&D by 30x and without any AI speedup we would have needed 30 years of research (serial years of the overall lab doing research) to totally obsolete all prior efforts. Then, we need 1 year and 3 months of delay to reach this point. If instead we could use an AI which can speed up R&D by 3,000x, then a few days are all that is required to obsolete all prior efforts, but we still need the 3 months of setup time. So, given that another bottleneck exists from human setup time, we can only achieve so much speedup by using more powerful AIs while more powerful AIs might be much more dangerous.

Overall, we think that the delay between controllable and transformatively useful AI and the first uncontrollable AI is reasonably likely to be too short (e.g., 6 months or 1 year) if capabilities are developed as fast as possible, but it's plausible that we can hold off on building uncontrollable AI for at least several years which could suffice.

Implications and proposed actions

We think AI labs should strongly consider committing that they'll ensure the control property for all models which seem plausibly capable of scheming (including for internal usage), until they have an extremely solid argument for alignment.

It would be cheap to ensure control for current frontier models (e.g., GPT-4) if we pretended these models were capable of scheming but that weaker models (e.g., GPT-3) weren't because current frontier models aren't capable enough to come up with or execute sneaky plans. Ensuring control for these models is probably substantially easier than securing model weights against external attackers or compromised employees. Because these models are weak, we should be able to use extremely conservative control evaluations to demonstrate safety, and these very conservative evaluations will be relatively easier to run. Ensuring and evaluating control in current frontier models seems like a valuable exercise for AI labs.

When the AIs get more powerful, it will likely start to get expensive for labs to ensure control; we imagine that at the point where labs are running twice as quickly due to AI assistance, maintaining the control property will require maybe 10% of the resources of the company (although it's hard to give good estimates of this before people have tried in earnest to find efficient ways to ensure control).

We'd strongly prefer that AI companies have both control- and alignment-based lines of defense to prevent AI takeover. However, we recognize there could be circumstances (e.g., if they face massive inaction risk [AF · GW]) in which they should deploy their models with only one of these lines of defense: for example, deploying models known to be catastrophically misaligned while using control, or deploying models without control but where we have some empirical evidence that the models aren't egregiously misaligned.

In "Control evaluations and black-box techniques" above, we argued that before naturally occurring scheming models have been discovered, we probably won't be able to gain sufficient confidence that non-black-box techniques will work reliably on schemers. In situations where there are two classes of competing techniques for a problem, one of which relies much less on generalization properties of models (e.g., scalable oversight and weak-to-strong generalization [AF · GW]), that distinction should probably affect your prioritization between the techniques.

We expect it to be impossible to ensure safety without relying on alignment (or some other non-black-box strategy) when working with wildly superintelligent AIs because black-box control approaches will depend on a sufficiently small gap in capabilities between more trusted labor, both humans and trusted weaker AIs [AF · GW], and our most capable AIs. For obvious reasons, we believe that labs should be very wary of building wildly superintelligent AIs that could easily destroy humanity if they tried to, and we think they should hold off from doing so as long as a weaker (but still transformatively useful) AI could suffice to solve critical problems. For many important uses of AI, wildly superintelligent AI likely isn't necessary, so we might be able to hold off on building wildly superintelligent AI for a long time. We think that building transformatively useful AI while maintaining control seems much safer and more defensible than building wildly superintelligent AI.

We think the public should consider asking labs to ensure control and avoid building wildly superintelligent AI. In the slightly longer term, we think that requirements for effective control could be a plausible target of regulation or international agreement.

We think that technical researchers should target AI control as a core strategy for reducing risk from scheming models. If they want their techniques to be robust to scheming, they should explicitly red-team them. And when they are working on techniques that don't aim to be robust to scheming (which seems reasonable; safety problems can arise from models that aren’t scheming), they should be explicitly aware of this and feel comfortable relying on that assumption in their techniques.

Safety researchers often use the word “aligned” in cases where we’d say that a protocol is “safe and useful”. We think it’s extremely bad terminology to use the word “aligned” in this way. Most importantly, if you are relying on control for safety, your safety requires continued vigilance, and such vigilance might be less likely if you conflate alignment (“we don’t need to be vigilant”) with safety (“we are safe because we are vigilant”).

We think that AI control is a particularly tractable approach to managing risk from scheming models, and we are excited about further research on this.

Appendix: Ready-to-go strategies

In this post, we're trying to avoid relying on any fundamental progress. We're trying to describe a strategy that we think is ready-to-go, in that it doesn't require fundamental technical advances in order to argue that techniques would work to mitigate risk from scheming. For example, techniques that rely on the generalization properties of models might completely fail when applied to scheming models. In order to feel comfortable relying on this not happening, we would need novel science, but we don't currently see a clear path to getting that.

We think that focusing on "ready-to-go" strategies is attractive for several reasons:

We might need to rely on techniques like those that are already ready-to-go.
- It's conceivable that we need to manage risk from scheming starting soon. Further, labs might need to start implementing strategies for safety from scheming even sooner than that (e.g., Anthropic is probably obligated by their RSP to say something about their approach for handling scheming before starting to train models that are smarter than ASL-3).
- None of the research directions we're aware of that would render control obsolete seem overwhelmingly likely to work. It's plausible that other research directions don't pan out in even 10 years of research, and we end up developing catastrophically-risky AI without qualitatively different techniques available to reduce risk from scheming.
It's epistemically grounding to think about the actual current state of science when you're thinking about how to build qualitatively novel things.
- It’s useful to understand what techniques have actually been validated, and what evidence would need to be obtained in order for us to rely on other techniques.
- When developing new techniques, it's good to have a relatively specific baseline plan in mind, so that you can pick research projects that target weaknesses of the baseline plan.
Even if powerful new techniques are developed, it's plausible that we'd still want to use these in combination with ready-to-go strategies for another line of defense.

Accordingly, in this post, we don't spend that much time on techniques which aren't well-developed enough yet for us to rely on them to reduce risk from scheming.

Appendix: Control techniques from our paper

In our recent paper [LW · GW], we demonstrated control techniques that involve changing the evaluation and deployment of powerful models in ways that combine to yield improvements in safety. Here are some of the techniques we explored in our paper. In this list, H is a human (which we can only use to a limited extent), T is a trusted model, U is the untrusted and more capable model we're trying to control.

Trusted monitoring: Produce code using U, and ask T to review how suspicious it looks. Audit the top 2% most suspicious code using H (we can't audit more than 2% due to limitations on how much we can use H that are part of the setting we explored in our paper).
Trusted monitoring with "defer to trusted" (with a 10% threshold): Apply the same approach as trusted monitoring, but submit a solution written by T when U's code is between the top 2% and top 10% suspicion.
Untrusted monitoring: Produce code using U, and ask for a separate copy of U to monitor how suspicious it looks. Audit the top 2% most suspicious code. Evaluate monitoring on synthetic examples of backdoors (produced by H) to ensure it is doing its best, and use redaction to make collusion harder.
Trusted editing (with a 20% threshold): Ask U to produce code, and ask T to edit its code to remove backdoors. Use trusted monitoring to determine suspicion. Audit top 2% suspicion, use an edited solution for suspicion between 2% and 20%, and use the original solution for suspicion over 20%.

In our paper, we ignored the problem of how you construct a reliable but expensive catastrophe auditing process using H (it's expensive because it needs human labor). This is a simplifying assumption. In general, control will require you to develop techniques that let you use access to U, H, and T in order to confidently assess whether actions were subtly dangerous.

Control techniques often involve interactions between several different interventions that can't be analyzed independently, and which happen both before and during deployment. For example, the untrusted monitoring strategy requires evaluating the untrusted monitor on fake examples of backdoors before deployment, and that we always (during both evaluation and deployment) use paraphrasing and redaction to prevent an instance of the untrusted model writing code and the instance doing monitoring from coordinating (picking a Schelling point or communicating steganographically). We expect that many control techniques involve these sorts of interactions and that thinking about and utilizing these interactions will be key.

Appendix: A long list of control techniques

Here is a long list of possible control techniques aimed at showing that the space of control techniques looks like:

Improving the ability of humans to identify if specific actions are bad in ways robust to scheming (e.g., perhaps debate with exploration hacking countermeasures). This is scalable oversight, but specifically scalable oversight robust to scheming.
The above, but specifically targeted at cases where a single failure can be catastrophic so we can't depend on average case performance guarantees.
Using weaker but trusted models for high-stakes monitoring which is applied before executing AI actions.
We could also develop techniques for using more powerful untrusted models to improve trusted model high-stakes monitoring.
Using trusted AIs to remove or flag distribution shifts, hard-to-generate but easy-to-recognize features, or other coordination signals.
Using untrusted models for monitoring with collusion-busting techniques.
Removing steganography via paraphrasing.
Performing tasks with teams of AIs where weaker and/or trusted models perform some specific subtasks to make it harder for scheming AIs to cause problems.
Decomposing tasks in a way that restricts information needed to cause problems.
Information siloing and restricting communication so AIs have a harder time coordinating.
Scaffolding, prompting, or other elicitation to make weaker but trusted AIs considerably more useful (without making them more capable in ways which might undermine their trusted status).
Adversarially evaluating AIs using a synthetic input generation process which can generate inputs that AIs can't distinguish real inputs from even after being trained to distinguish.
Making precise trade-offs in terms of what affordances various AIs should have, what research directions to pursue, which AIs to use for which tasks, and when to require a human to perform a given task.
Security mechanisms which were originally developed for usage against human threats. These could either be targeted at insider threat (employees) or at external attackers. Insider threat protective measures seem like they might be particularly applicable to AI. Many of the approaches and techniques we discuss here [LW · GW] for mitigating catastrophic misuse of AIs deployed on an API are repurposable for control (or vice versa).
Improvements to computer security related to securing weights and ensuring that AI usage is tracked and monitored.
Writing a novel operating system to handle concerns related to AIs finding zero-days for code just via reading all of it in pretrain. The AIs used in production would never see any of this code and would not be trained on it. (We could have untrusted AIs write this operating system to keep costs sufficiently low, but this does mean we'll need to use these untrusted AIs for a while without this operating system, and it adds the additional threat of AIs intentionally planting vulnerabilities.) Further, we could add a large number of fake zero-days to this novel operating system and then verify that whenever AIs come across the fake zero-days they report these zero days (and ensure that attempted utilization of these fake zero-days is immediately caught).
Approaches for ensuring and validating human compliance with the protocol – for instance, monitoring humans using AIs and performing automated ongoing red teaming of humans with safety-critical roles.
Methods for making it easier for humans to follow complex rules for interacting with AIs (possibly via automating parts of these rules using trusted AIs, but also via simple approaches like checklists and flowcharts).
Ensuring that AIs don't know which humans they are working with via interventions like paraphrasing to remove style. This could be verified by seeing that AIs can't fit a supervised training dataset on which person wrote which query.
Countermeasures to exploration hacking such as having humans or trusted models construct supervised fine-tuning data which forces the AI to explore good options.
Various tests to ensure that AI-constructed training data isn't backdoored or otherwise problematic.

Appendix: Why control evaluations are conservative about inductive biases and AI internal structure

In this appendix, we'll elaborate on the arguments made in "Control evaluations and black-box techniques". As noted earlier, we want to avoid depending on incorrect assumptions about the properties of scheming models which is a big problem for techniques that rely on properties of models related to their generalization behavior or their internal structure.

For example, coup probes [AF · GW] are a technique we proposed where you train a linear probe on off-policy examples of egregious model misbehavior. (That is, in the context of a chatbot, you "put words in the AI's mouth", writing problematic things as the Assistant in cases where the model would not have misbehaved in response to its input.) Then, during deployment, you flag actions that the probe classifies as potentially dangerous.

Will coup probes work? We don't know. It depends on how saliently the model represents "this is a dangerous action" compared to other concepts (like "this is the kind of sequence on which I'll get more of my long-term objective if a coup probe says Yes"). Coup probes definitely don't work on all possible models–a red team could easily produce a model on which coup probes do not generalize correctly. Thus, if we wanted to account for this technique in our control evaluations, we'd need to somehow estimate the probability that if our model is a schemer, it will represent this particular feature in the required way.

We don't know how you could do experiments that would cause you to become confident in coup probes working. Suppose you have access to this potentially scheming model and want to know whether it represents the 'coup feature' in such a way that coup probes will generalize correctly. It's not clear what you can do. Unless you make substantial progress at understanding the internal structure of models, we expect you'll have to rely on your priors for how different features tend to be represented, which aren't going to be very confident.

Another source of difficulty is that if our models are scheming, they might attempt to sabotage experiments we do about how they generalize. For example, consider the following experiment:

A researcher wants to know whether training the model to be honest in some particular distribution generalizes to it being honest about nefarious actions. So they take the model and finetune it to take nefarious actions in some situations. Then, they train the model to be honest on some particular distribution. Now, they test whether the final AI is honest about its nefarious actions. If the model were scheming to begin with, it might intentionally try to generalize to being honest about the nefarious actions to deceive the researcher into thinking the technique works.

It might seem implausible for a model to have this level of fine-grained control over how it generalizes. But once you've conditioned on your model scheming, you've already assumed a high enough level of fine-grained control over its generalization that this type of sabotage seems plausible.

It seems much easier to become confident that a technique will work if the only property that the technique relies on is that the model can't do something. For example, a control technique might involve humans reading code that a model has written to determine whether it has any crucial vulnerabilities. This technique relies on the model not being able to write backdoors that the human reviewers don't notice. Evaluating whether the model is capable of writing such sneaky backdoors has some difficulties, but we think we'll be able to sufficiently reliably estimate capabilities related to subverting safety techniques. (This is primarily because successful exploration hacking seems difficult given that the AI has limited serial reasoning time relative to the red team; see the discussion earlier on limited serial reasoning time for more discussion.)

For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model. We expect that this methodology forces you to rely solely on techniques that treat the model as a black box. See "Appendix C: the black box protocol assumption" in our paper for more on this.

Note that when trying to reduce sources of risk other than scheming, you don't have to worry about the models intentionally sabotaging the evals, and so it seems much easier to evaluate generalization- and internals-based techniques [AF(p) · GW(p)]. For this reason, we think that non-black-box techniques are probably more applicable to non-scheming threat models than to scheming threat models.

Because these black-box control evaluations^[27] can't assess whether internals or generalization-based techniques improve safety, there could be a gap between the control we're able to more confidently assess (measurable control) and the safety from scheming we have in practice (safety-in-practice) which includes both the control we have in practice and the possibility for avoiding scheming (alignment). However, we think that absent substantial advances in science, we're unlikely to develop approaches which substantially improve safety-in-practice beyond baseline methods (e.g., training with RLHF and applying coup probes) without the improvement being captured by black-box control evaluations.^[28]^[29] We might discuss and argue for this in more detail in a follow-up post.

Appendix: Concerns with "neuralese" reasoning and memories

One factor which is very important for the viability of control is the extent to which making AIs transformatively useful requires those AIs to have the affordance to do long serial chains of uninterpretable^[30] reasoning (e.g., recurrent "neuralese" state) or to have the affordance to communicate and share memories with other AI instances using uninterpretable messages (e.g., neuralese mind-melding). We think this concern is likely to be manageable and control can work even with a substantial amount of uninterpretable serial reasoning, but this seems like one of the main reasons early transformatively useful AIs could be uncontrollable.

If AIs have long serial chains of uninterpretable reasoning then it would be harder to:

ensure that AIs can't think of plans better than our human red team for control evaluations;
monitor for problematic behavior;
elicit and estimate capabilities to a standard competitive with how well scheming AIs perform when "actually trying". This is due to the lack of intermediate supervision on these long chains^[31]

Whenever AIs are allowed to communicate with other AI using uninterpretable messages, then avoiding coordination and collusion between those AIs is virtually impossible. So for control, we'll likely minimally need to avoid every AIs being able to communicate arbitrary uninterpretable messages with any other AI.

Right now, state-of-the-art AIs are LLMs with relatively weak forward passes and which can only accomplish hard tasks that require many serial steps (insofar as they can accomplish them at all) by reasoning in natural language using CoT. However, it's plausible that the amount of uninterpretable serial reasoning will substantially increase over time and further, that for AI to be transformatively useful, the effective serial depth of uninterpretable reasoning will have to increase substantially. It's not clear to us either way, though this is indeed a key question which affects the difficulty of various AI x-safety agendas. Even if transformatively useful AI is created just by scaling up LLMs, the amount of uninterpretable reasoning will still increase over time due to larger scale making the forward pass more powerful. (That said, the fraction of reasoning which occurs in the forward pass might not increase because approaches like chain of thought will get more powerful as well.)

Another concern which is closely related to “neuralese” is that AIs using natural language reasoning could use steganography to perform hidden reasoning or pass messages [LW · GW]; we hope to handle steganography with approaches like paraphrasing [LW · GW] (same paper as the immediately prior link). While natural language reasoning done by LLMs isn't fully interpretable or faithful [AF · GW], the key property we want from this natural language reasoning is just that the AI can't perform natural language reasoning about unrelated or unintended topics without this reasoning being apparent. More precisely, it can’t use natural language reasoning to reason considerably better about an unintended topic relative to what it can do in a forward pass. That is, the key property we want is to be able to rule out considerably-more-powerful-than-a-forward-pass reasoning on unintended topics (rather than necessarily understanding the exact reasoning process on the intended topic^[32]). We think this key property of natural language reasoning seems to hold for current models and should be possible to maintain using anti-steganography measures.

One approach for AI learning and memories which might be key is having AI construct training datasets (in relatively understandable modalities like text, images, or motor activations). Intuitively, this seems much safer than neuralese, though there still might be substantial risks due to the AI-constructed datasets being backdoored or containing steganographic communication. We think these concerns can probably be handled with control.

Returning back to more fundamentally uninterpretable reasoning, one way to operationalize the deep uninterpretable serial reasoning concern is the effective duration of serial reasoning prior to an "interpretable bottleneck" (where for LLMs, text outputs form a (somewhat) interpretable bottleneck). For instance, perhaps GPT-4's forward pass is similar to a few seconds of serial reasoning for a human (but using extremely good heuristics and broad knowledge). We think that for AIs to be transformatively useful, they might need serial uninterpretable reasoning which is the equivalent of a human thinking for several months, but they likely don't need much longer durations. Further, we think this uninterpretable state can likely be siloed within individual projects without huge loss of performance (neuralese mind-melding isn't needed or at least doesn't need to be global). And, probably many tasks can be delegated to instances with far less duration or state.

A reasonable intuition pump here is that if AIs were qualitatively as smart as humans, but ran faster and cheaper, then we probably aren’t forced to apply the AIs in ways which are vastly different than how humans work. Thus, massive amounts of mind melding and extremely long chains of uninterpretable serial reasoning aren’t required (as humans can get lots done without these affordances).^[33]

Part of our view here is that just having a synchronized copy of the weights and then training the AI online with normal SGD allows for capturing a reasonable fraction of the value from "sharing memories" and other types of learning while still (very likely) not allowing for arbitrary communication or arbitrary serial reasoning. Using AIs to construct training data as discussed earlier might also help to avoid the need for neuralese.

If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning. We think it’s plausible that transformatively useful AI will require using uninterpretable “neuralese” state or message passing, but this is by no means certain. Further, if transformatively useful AI occurs soon, it is somewhat less likely to be the case that it uses neuralese.

Appendix: Should AI-takeover-concerned people invest in control?

There are broadly two types of investments we might want to make in control:

If you're an AI lab, should you plan to ensure safety by implementing control techniques and evaluate safety with control evaluations? If you're an AI regulator or external advocate, should you push AI labs to do so?
If you're an AI safety technical researcher or an AI lab, should you allocate research effort to control techniques and control evaluations?

For investment (1), the main question is "how likely is it that control evaluations are an accurate way to measure risk due to scheming"?

For investment (2), the main question is the tractability of improving techniques and evaluations.

For both types of investment, one of our key supporting claims is that we don't know how to ensure that AIs aren't scheming. You might disagree with this because you believe either of the following:

Non-scheming is trivial to ensure: "AIs are extremely unlikely to scheme, even if AI developers don't try very hard. So, AI safety researchers should deprioritize risk from scheming models."
Non-scheming is quite likely doable to ensure: "AI developers will be able to ensure that their AIs aren't scheming, because they'll put substantial effort into it and, with this effort, they're quite likely to succeed."

If we don't know how to ensure that AIs aren't scheming, a good approach for measuring risk from scheming is to conduct control evaluations to measure P(doom | scheming) and then multiply this by our best estimate of P(scheming) (roughly speaking). Similarly, our main approach for mitigating risk from scheming will be via control techniques^[34].

However, even if the main approach for reducing and evaluating risk from scheming is AI control, this isn't sufficient to make researching AI control (investment 2) worthwhile. Our belief that researching control is worthwhile depends on another key claim: investment in AI control will substantially improve evaluation quality and/or result in techniques which substantially reduce risk from scheming AIs. You might disagree with this because you believe that for all important AI capability levels one of the following holds:

Control is too easy: "Simple and cheap interventions like human supervision, trusted monitoring, and computer security will suffice. Further, conducting accurate enough control evaluations will be easy. So researching control has substantially diminished marginal returns."
Control is infeasibly hard: "Even if labs try as hard as they realistically might, control interventions won't be able to substantially reduce risk from scheming and evaluations will be very inaccurate. So researchers should focus their efforts elsewhere."

	Control is too easy	Control requires a substantial but affordable amount of effort	Control is infeasibly hard
Non-scheming is trivial/doable to ensure
Ensuring non-scheming is hard		Control looks good

Above, we grouped together investment in improving evaluations and in control techniques. It's possible that improving control evaluations is very tractable, but that control techniques are either too easy or too hard at every important capability level. Or vice versa.^[35]

It's worth noting that research into control (investment 2) has a moderate level of transfer with some other hopes for improving AI safety like using better oversight to avoid reward hacking and conducting better dangerous capability evaluations to better inform decision-makers. We expect that a variety of research avenues will help to improve control and that control-focused research will have applications beyond reducing risk from scheming via control. But, having a reasonable subset of people focus specifically on control does seem valuable.

Beyond this, implementing the standard computer security aspects of control probably has a reasonable amount of overlap with implementing the internal control and security necessary to secure the model from employees (insider threat). However, there are aspects of control which look quite different than computer security and also likely parts of computer security for control which will require substantial additional implementation beyond what is needed to handle insider threat concerns.

We discussed the case for optimism about control and the need for control earlier in "Control is likely achievable for otherwise-dangerous models". We discuss difficulties and limitations with control in Section 5.1 in our paper.

Overall, we think that investing in control is worthwhile and this should be a high fraction of the portfolio for preventing takeover.

Appendix: More detail on avoiding qualitatively wildly superhuman skills in problematic domains

In this appendix, we'll argue in more detail that we won't run into issues related to qualitatively wildly superhuman skills in problematic domains.

First, we don't expect that early transformatively useful AIs will have qualitatively wildly superhuman capabilities across-the-board. That is, we don't expect they will have such capabilities which cover all or most of the important domains which would make an AI be qualitatively wildly superhuman in general. This is because we expect that models that were in general non-qualitatively-wildly-superhuman would suffice for being very useful. Given the current paradigm of training AIs, it will likely be possible to run a huge number of copies and to run these copies quite quickly [LW · GW]. So to be very useful, it probably suffices for AIs to be around as capable as top human scientists (but cheap and fast). Of course, being wildly superhuman in terms of some notion of general intelligence would likely result in the AI being qualitatively wildly superhuman across many or most important domains, but we think that transformatively useful AIs won't need to be wildly superhuman in general intelligence.^[36]

So it seems that the minimum-viable very useful AIs are unlikely to be qualitatively wildly superhuman across-the-board.

But what about cases where the AI isn't qualitatively wildly superhuman across-the-board, but happens to be qualitatively wildly superhuman at some dangerous skill? There are two main situations we worry about here:

You purposely trained your model to be qualitatively wildly superhuman at something closely related to something dangerous. (This often overlaps with the first bullet in the above list: knowing that the AI's outputs are inscrutable. But this isn't necessarily true.)
Your model became qualitatively wildly superhuman at something dangerous in the course of you trying to get it to be sufficiently useful at some other task.

We'll discuss these cases separately in the next two subsections.

Cases where we purposefully train for qualitatively wildly superhuman performance in a danger-adjacent domain

One example here is protein engineering. You really needed your general-purpose AI to also do some protein engineering, and you accomplished this by additionally training it on a whole lot of protein structure prediction and related data. And then you say "protein that makes cancerous cells turn green, safe, easy to synthesize, nobel prize, trending on proteinstation", and it directly emits some amino acid sequence without needing to run intermediate tests or needing externalized reasoning. In your initial tests, the protein sequences they suggest seem to work great, but you're worried that if you use the AI-designed proteins widely, something terrible will happen.

This case indeed seems unresolvable with control (without forcing the AI to do the task in a different and more legible way). This case was hard because the thing we wanted to do with the model involved it being qualitatively wildly superhuman at a task inextricably related to a dangerous capability.

If we had separately trained an LM agent and also AlphaFold 5, and then the LM agent used AlphaFold 5 like a human does, we wouldn't have had this problem. The problem was that we end-to-end trained this model such that it had access to inscrutable knowledge while also maybe being a schemer.

Another option we have here is to train two copies of your LM agent, one of which knows a lot about protein structure and the other of which doesn't, and then only use the protein structure model in cases where it's not going to have the opportunity to do anything dangerous with that knowledge.

However, it seems like the applications on Karnofsky’s list probably don't directly require models to have superhuman knowledge or abilities in dangerous areas.

Further, note that in this protein case, it was obvious to us that the model was qualitatively wildly superhuman. (Though it might not be obvious that a specific action produced by such an AI is inscrutable or has other problematic properties.) We expect that if we run into issues due to qualitatively wildly superhuman performance resulting from intentionally doing a bunch of training in a danger-adjacent domain, it is reasonably likely to at least be obvious that we might have issues.

Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness

As a trivial non-dangerous example, models are wildly superhuman at next-token prediction. We don't intrinsically care about next-token prediction; it just turns out that the easiest way to make models that do an adequate job of being chatbots is to train for next-token prediction. If superhuman next-token prediction skills were dangerous, we'd have a big problem.

Also, the current best LLMs are quite superhumanly knowledgeable, but this doesn't seem to result in wildly superhuman performance in any non-trivial domains (yet). We expect that the level of superhuman knowledgeability of future AIs probably doesn't suffice for wildly superhuman performance at any of the key problematic domains, but might result in wildly superhuman levels of performance in some less important domains.

For an example of when AIs might become qualitatively wildly superhuman at a problematic domain in the course of us trying to train them to be sufficiently useful, suppose the only sufficiently large pretraining corpus we have access to was chat logs between humans who are trying to extort and manipulate each other, from which the AI can learn to effectively model the world. If that's the only pretraining corpus you have, you might have a lot of trouble training a model to be very useful without also being really good at extortion.

In this case, you didn't want to make your model wildly superhuman at anything, but you got the wild superhumanness (at extortion) by default while trying to get it to human level at the thing you wanted. (In principle, unlearning techniques could resolve this issue, though we don't think any unlearning technique has been demonstrated to solve a problem like this.)

We think there are a relatively small number of core problematic domains (hacking, persuasion, and manipulating scientific results or conceptual arguments). Performance on realistic tasks within each of these domains is likely to be quite correlated with other tasks in that domain (in the absence of heavy specific training on that subset of the domain). Thus, it's not as though we need to ensure near the human capability profile on a huge number of uncorrelated domains.

Further, these problematic domains seem like the types of domains that would be at least somewhat correlated with general capabilities/intelligence as long as we don't train AIs on these specific tasks much more than we train them on the core tasks we care about. (It's important to caveat this argument such that it doesn't imply that AlphaGo isn't superhuman at Go!)

And, we probably won't need to use a training regime which massively disproportionately focuses on dangerous tasks or tasks in some problematic domain in order to make sufficiently capable AIs. If capabilities substantially come from predictive/generative pretraining, then it seems unlikely that this training data is relatively good at teaching probability abilities. If capabilities substantially come from outcomes based RL, then we can probably find an environment which isn't that much worse at teaching general competence while avoiding teaching dangerous abilities much more than bottlenecking useful abilities.

There are a few specific counterarguments we hear related to generative/predictive pretraining.

One argument is that by the time AIs trained like LLMs are good at science, they will be much better at persuasion than humans due to being trained on a huge amount of data of humans talking and interacting. This seems like a reason for concern, but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science. So training might make LLMs relatively better at persuasion, but human genomes might have also been selected to some extent to be relatively better at resisting persuasion. Current state-of-the-art LLMs don’t seem particularly good at persuasion relative to science (compared to humans).

Another specific counterargument people give is that maybe programming models will instantly know excellent zero-day vulnerabilities from reading all the code in the world (e.g., the Linux kernel, our HTTP server implementations, etc). We are not persuaded that a model which is human-level at programming will automatically notice zero-days while being trained on code, and anyway it's probably affordable to write a new software stack with different zero-days and prevent most of our AIs from reading its source. (That said, rewriting the software stack doesn't stop AIs from understanding hardware vulnerabilities and noticing hardware zero days).

Our view is that the evidence from current state-of-the-art LLMs and other AIs is mostly reassuring. It seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g., broad programming tasks, writing good essays, doing research). However, this is based on a small number of data points as AI just recently became capable of this level of performance.

We don't know of cases where LLMs are qualitatively wildly superhuman at non-trivial random tasks which they somehow picked up from pretraining (in the sense we defined via the list given in the Section “We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains”, but it would be great to look more carefully. Some such tasks probably exist. There might be some non-trivial classes of images on which the current best image models are qualitatively wildly superhuman, perhaps photorealism in some cases?

Overall, we think it's pretty unlikely for models to acquire qualitatively wildly superhuman skills in problematic domains that would break control. But, this is one of our main uncertainties about the viability of control and thus we're pretty interested in further investigation and counterarguments.

Aside: Capabilities from imitating human cognition

It currently seems plausible (perhaps 30-40% probability overall, more in shorter timelines) that a large fraction of capabilities in early transformatively useful AI will come via the mechanism of imitating human cognitive processes, which naturally yields an ability distribution relatively close to humans^[37]. However, note even if a model gets its capabilities entirely from being trained on a prediction objective, it's still possible for most of its useful cognition to come from sources other than imitating human cognitive processes, see here [EA(p) · GW(p)] for more discussion.

AI control could also be applicable to handling advanced data poisoning attacks, non-scheming misalignment concerns, and API-enabled catastrophic misuse, but we expect there are easier ways to sufficiently handle these concerns. ↩︎
In this post, we'll mostly use "alignment" to refer to "models aren't scheming". This is narrower than how people usually use the word–avoiding scheming is usually considered necessary but not sufficient for alignment. ↩︎
Note that we are using the term control differently than some prior uses of the term. For instance, we're using the term differently than "AI is easy to control". ↩︎
As one operationalization, suppose that you had two teams at your AI lab, one of which was much more competent, and you had to assign one to be in charge of minimizing the probability that the AIs you're about to train and deploy were scheming, and the other to be in charge of ensuring control. To address risk due to powerful AI in the next 8 years, we think it would be better to assign the more competent team to ensuring control, because we think that the quality of control interventions is more influential on this risk. ↩︎
In practice, we'll care about the exact probability of safety rather than a binary safety threshold. We also might care about actually getting useful work even if our AI is scheming competently rather than just more directly ensuring safety. So, putting this all together, we care about E[utility|scheming]: how good outcomes are in expectation given that an AI is competently trying to cause bad outcomes. Separately, we also care about E[utility|not scheming], but we won't talk about measuring and assessing this here. ↩︎
You can imagine that we're running our evaluation after the model has been trained, but prior to using it. One reasonable objection here is that the very act of training a scheming model or running our evaluation will be dangerous. First, note that we think the evaluation can be made much safer than normal utilization (see here for further discussion). Then, we can overall handle concerns due to danger during evaluations and training by periodically running evaluations throughout training (e.g., every 4x effective training compute increase) and ensuring our evaluations are conservative enough to ensure safety for interval of training prior to the next evaluation (as discussed in more detail here). ↩︎
More generally, we'd like to have a probability distribution over all the different possible sets of models which might be scheming (e.g., P(all models more powerful than GPT-5 scheme), P(all models more powerful than GPT-6 scheme), etc.) and take this into account when assessing an overall estimate of safety. It's not clear how to get a principled guess at the probabilities here, but we can at least conservatively assume that all models which could plausibly be scheming are scheming and then estimate safety in this case. ↩︎
Rather than just training the model to accomplish the tasks, we could also use some other approach to estimate the ability of the model to execute the red team strategy. For example, you could elicit the model's capabilities on another task which you expect the model to be just as good at, but where capability elicitation is easier. Or you could use a scaling law to extrapolate its performance from smaller models. Or you could try to conservatively upper bound its capabilities using a supervised training set which is clearly easier to fit than actually accomplishing the dangerous task. Because of these options, in some cases, capability estimation is considerably easier than capability elicitation, so this distinction could be important. ↩︎
One concern with these evals is that they might actually make models dangerous or misaligned even if they wouldn't be dangerous by default, or more generally that the evaluation is dangerous to run. We think this is probably not a substantial issue, because we can often use benign tasks as proxies for dangerous tasks. That said, we'll still need to train the model to exhibit dangerous capabilities like evading monitoring. See here for further discussion. And remember that we're just training these models as part of the evaluations; these models won't actually be deployed. ↩︎
These strategies found by the red team will be relatively high level strategies. In particular, the strategy should be written with realistic knowledge about the situation and shouldn't be insanely long (e.g., perhaps we limit it to about 4k words), so the task of causing problems given a good plan for causing problems is pretty non-trivial. We could conservatively assume the same knowledge as an AI lab employee prior to the start of training and deployment, or try to make more precise restrictions on knowledge based on the AI's exact situation. Note that we do need to take into account the AI learning things over the course of deployment from training or other sources. The plans found by the red team still might be unrealistically good, so we might want to further limit the plans if we can get a better understanding of how good of plans the AI could plausibly produce; see "Limit red team plan quality" and "Limit red team plan knowledge" here for more discussion. ↩︎
Our goal will be to measure how capably the model would perform if it was scheming and "actually trying" to do a given action and thus did its best to elicit its own capabilities; we don't need to know how capable the AI could be if a large amount of resources were spent on finding better elicitation techniques (e.g., better training datasets and better scaffolding). Additionally, in many cases, we care about how capable the AI is with important restrictions on its affordances and reasoning due to control techniques (e.g., using a weaker trusted model to monitor its natural language reasoning) which might make the search space for elicitation smaller. Our evaluation target is unlike assessments for how problematic it would be to have a containment failure (e.g., model theft); assessments aimed at determining danger due to containment failure need to account for the possibility that improvements in elicitation over a few years result in dangerously capable AIs and also can't assume that there are any restrictions on the affordances of the AI. ↩︎
This relies on the assumption that the AIs are unable to secretly develop a body of knowledge of good takeover strategies. To ensure this, you probably need to use anti-steganography techniques [AF · GW] and avoid models having arbitrary access to uninterpretable memory. We think that some access to uninterpretable state can be acceptable so long as we avoid arbitrary accumulation by having "interpretable bottlenecks": via methods like siloing access to uninterpretable state (e.g., between different projects) and sometimes wiping state and forcing AIs to convert it into some more interpretable media such as a text database or a training set. See here for more detail. ↩︎
Some subset of AIs might have to reason about our countermeasures as part of their job (e.g., for assisting the red team on a future evaluation or helping to design future protocols). However, we can try to split up this reasoning into different parts and impose additional (expensive) countermeasures on these AIs. Also, the job of reasoning about countermeasures doesn't require many of the dangerous affordances like the ability to run ML experiments or add code to our codebase. ↩︎
There are also additional complications associated with what happens if we catch AIs red-handed [LW · GW]. Most importantly, after catching the AI, the AI lab should very likely shut down (at least temporarily) and study what happened in more detail. Even if we are forced to continue using AIs, there now exist a set of mitigations and tests which don't look like control [LW · GW]. So for control evaluations, we might not model what happens after catching the AI and instead make an educated guess about how good of an outcome it is to catch the AI. ↩︎
To keep the absolute level of risk quite low (e.g., 1%), there are a few possibilities. It might be that the chance of scheming is sufficiently low by default that overall risk can be kept moderately low via just control. (E.g. if the chance of scheming is 10% and control suffices to reduce risk conditional on scheming to 10%, then the overall probability of catastrophe due to scheming would be 1%.) Experiments that provide information about the likelihood of scheming could be quite useful in making informed decisions here. Alignment techniques which actively lower the risk of scheming could also combine with control to form two lines of defense. Even if we were certain our AIs were scheming, it might be possible to achieve an absolutely low level of safety by achieving a high level of measured safety according to very conservative control evaluations. We think it will be too hard to achieve a high level of measured safety according to very conservative control evaluations for transformatively useful AI, but passing very conservative control evaluations might be doable for earlier AIs which could still pose risk. ↩︎
If models don't scheme, then evaluating whether models have some other type of catastrophic misalignment issue is quite a bit easier, because these non-scheming models won't try to sabotage our evaluations. However, if models do scheme, then alignment evaluations become much harder, while capabilities evaluations like control evaluations might still be viable. As far as evaluating for the presence of non-scheming misalignment issues like reward hacking, we could try to measure reward hacking using held out validation either in our actual training setup or in analogous settings where measurement is easier. We could even test techniques to mitigate reward hacking. We're excited about this sort of direct measurement and analogous settings approach for non-scheming concerns, but we think these sorts of approaches are much less promising to address scheming. ↩︎
Some scalable oversight research has been done in a different adversarial setting, where e.g. you have two debaters who are each trying to trick the judge. This is different from the robustness property required here—for control, you need to know that your oversight scheme is acceptably reliable even when the two debaters are colluding to cause the judge to come to the wrong answer. ↩︎
In addition to being useful for control, better scalable oversight might also be slightly useful for reducing the probability of scheming^[19] and considerably useful for addressing non-scheming misalignment concerns. Also, some approaches for adversarial training/evaluation which help with control might be useful for alignment. Thus, research into control techniques and alignment techniques can have considerable overlap. ↩︎
Perhaps better feedback makes thinking about the training process less salient than this in turn reduces the probability of scheming? It might also be the case that better oversight allows for using training methods which make long term prediction less salient and thus make scheming less plausible from this route. We think these arguments are overall pretty weak. ↩︎
What's the precise limiting factor on black-box control? We think that inaccessible information issues are sufficient to imply that it's exceedingly unlikely that we can achieve black-box control for an arbitrarily powerful system (this may not be the only problem). Overall, we haven't fully ruled out the existence of an arbitrarily scalable black-box control strategy, but it seems exceedingly unlikely to exist. That said, it seems more possible (though still quite unlikely) that clever black-box control could scale to some generally superhuman systems without being arbitrarily scalable. However, this isn't our reason for thinking control is promising. ↩︎
Or possibly reducing the risk due to already escaped AIs in the wild. ↩︎
Note that we’re using “uncontrollable” as jargon which specifically refers to being too capable for control approaches to be viable. So, we could, e.g., have AIs which are safe because they are aligned despite being capable enough that they are uncontrollable. ↩︎
Recall that “uncontrollable” specifically refers to being too capable for control approaches to be viable; uncontrollable AIs could be safe for other reasons. ↩︎
If problematic reasoning is this obvious, then simple interventions like trusted monitoring would catch the issue. ↩︎
In some cases, a long process of AI cognition might eventually produce superhuman outputs (e.g. using search). In this type of case, we don't necessarily have issues due to the concerns from the above list as long as humans can acquire a good and properly contextualized understanding of each step of the process that is used to create such a superhuman output. (In particular, we need a sufficient understanding of all the places where substantial optimization pressure was applied. Optimization pressure will be applied by AIs in each step of the process, but there might also be optimization pressure applied in constructing the overall process; in that case, we need to understand the overall process. See here [LW · GW] for some discussion of why.) So, in the case of (e.g.) explicit Monte Carlo tree search, we won't necessarily have issues even if the overall tree search is superhuman as long as human teams can figure out valuations for nodes which are sufficiently accurate and distributions over nodes to search which are sufficiently good. Accordingly, we think that superhumanness just due to quantity (e.g. parallel search) or speed rather than quality can be manageable and we wouldn't consider it a case of one of the qualitatively wildly superhuman ability concerns we listed above. ↩︎
This part of the situation isn't really a control argument, but it might be pretty relevant regardless. ↩︎
By "black-box control evaluations" we just mean that the control evaluations are conservative about inductive biases and AI internal structure. Running the evaluations will require fine-tuning at a minimum. ↩︎
In particular, we think that techniques that involve constructing datasets of very bad actions and training a classifier (perhaps with some particular tweak to inductive biases like using a probe) to identify these bad actions might be predictably promising for reasons which aren't possible to assess with black-box control evaluations. The next most promising improvement which can't be assessed with black-box control evaluations and which doesn't require substantial advances (perhaps gradient-based adversarial attacks to catch AIs) seems notably less promising. ↩︎
There are plausibly some large scale modifications to AI training methods which reduce the probability of scheming by a decent amount while keeping the model capable enough to be at least a bit useful. Applying these modifications in the training of some or all models is a possible improvement to safety on top of baseline methods which isn't captured by black-box control evaluations. This would most naturally slot into the control evaluation methodology by effecting our estimates for P(scheming) for different models. We're uncertain about the importance of these sorts of intervention. ↩︎
By "uninterpretable" we just mean "as hard to interpret as the activations of current AI models". It isn't literally impossible to develop a serious understanding of these activations, but we think the current state of interpretability is quite far from more than shallow understanding. ↩︎
The key concern here is that training would have very low sample efficiency at utilizing these long serial chains of uninterpretable reasoning, at least for getting AIs to do tasks they don't want to do. An objection would then be: if training has very low sample efficiency, how did we even get an AI which can competently do long serial chains of reasoning when it wants to? It could be that we did a huge amount of RL long-horizon [LW · GW] training to make the AI competent at doing this reasoning which results in the AI being able to use it itself however it would like, but we can't quickly elicit it in cases where the AI doesn't want us to do so. Alternatively, it could be that with only a small or moderate amount of training, AIs can effectively utilize some scheme for uninterpretable serial reasoning due to generalization from other capabilities. If our capabilities are due to a relatively far generalization, then further training might have low sample efficiency (at least when the AI wants training to fail) while still allowing for the AI to be dangerously capable. We might discuss threat models related to AIs having hard-to-elicit capabilities in future posts. Overall, we think sample efficiency will likely be reasonably high for early transformatively useful AI (even when scheming AIs want to sandbag), but there are some serious concerns. ↩︎
Some agendas do aim to understand the exact natural language reasoning process rather than just ruling out unintended reasoning and this would indeed produce additional value. We’re just claiming here that this isn’t required for one key source of hope. ↩︎
It might be the case that we need to use AIs (or have the option of using AIs) which are qualitatively weaker than humans, but which have structural advantages in terms of mind melding, learning speed, or extremely long durations of serial reasoning. In this case, it’s less clear whether this makes control harder or easier: the AIs are qualitatively weaker, but leveraging relatively problematic structural advantages. Overall, if we could pick we’d probably prefer to use AIs which are somewhat qualitatively smarter, but which don’t depend on potentially problematic structural properties. ↩︎
We can also apply non-black-box approaches which aim to reduce P(doom | scheming) such as coup probes [LW · GW]. We don't currently know how to adequately evaluate these approaches. If in the future we were able to develop powerful non-black-box techniques and gain enough scientific understanding to confidently ensure safety, this would also eliminate the need for control. ↩︎
If we think that improving control techniques is worthwhile, then high-quality evaluations are very useful for the purpose of building better control techniques. But, it can still be the case that it's intractable to improve evaluations (though we think this is unlikely in practice, particularly for cases where control techniques are useful). ↩︎
What about the possibility that due to issues in capabilities elicitation, you aren't able to elicit human-level performance from any AI which isn't capable of wildly superhuman performance across most important domains? That is, in order to reach human-level performance in the domains we need, we must train such a powerful AI that if it was actually "trying" it would reach wildly superhuman performance in most important domains. We think this seems unlikely, because you're probably able to elicit human-level performance prior to this point by training on a high-quality dataset of human demonstrations combined with some amount of RL. ↩︎
Even if most of the capabilities come from imitating humans, it is still possible to achieve superhuman performance, but it makes it less likely that wildly superhuman hacking and persuasion abilities emerge before AIs are very useful. ↩︎

73 comments

Comments sorted by top scores.

comment by FeepingCreature · 2024-01-24T16:35:56.383Z · LW(p) · GW(p)

A bit offtopic, but #lesswrong has an IRC bot that posts LessWrong posts, and, well, the proposal ended up both more specific and a lot more radical. A link saying "The case for ensuring that powerful AIs are controlled by ryan_greenblatt"

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-01-25T19:49:20.751Z · LW(p) · GW(p)

I personally think that all powerful AIs should be controlled by Ryan Greenblatt.

I don't know the guy, but he seems sane from reading just a little of his writing. Putting him in charge would run a small s-risk (bad outcomes if he turned out to have a negative sadism-empathy balance), but I think that's unlikely. It would avoid what I think are quite large risks arising from Molochian competition among AGIs and their human masters in an aligned but multipolar scenario.

So: Ryan Greenblatt for god-emperor!

Or whoever else, as long as they don't self-nominate. Those who seek power probably shouldn't have it.

This might be termed the Hitchhiker's Guide political system. They chose the true secret president of the galaxy at random, and the person always hated the job.

comment by Richard_Ngo (ricraz) · 2024-01-27T00:39:00.872Z · LW(p) · GW(p)

Copying over a response I wrote on Twitter to Emmett Shear, who argued that "it's just a bad way to solve the problem. An ever more powerful and sophisticated enemy? ... If the process continues you just lose eventually".

I think there are (at least) two strong reasons to like this approach:

1. It’s complementary with alignment.

2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…

As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.

This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence threshold at which bad things happen.

This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.

You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.

Replies from: aysja, Roko

↑ comment by aysja · 2024-01-29T18:32:34.193Z · LW(p) · GW(p)

The incremental approach bakes in a few assumptions, namely that there likely won't be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don't know that this will hold, and that there's reason to suspect it won't. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.

I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want, here, than “thousands of people incrementally contributing.” When fields are thoroughly confused, thousands of people contributing mostly just amounts to more confusion, especially when everyone is working within a broken paradigm. And often in this situation—when new fields are needed—substantial scientific progress happens because of one or a few individuals. I don’t expect a single person to solve the entire problem by themselves, of course, but I do think it’s plausible that one or a few people will develop a robust underlying theory that marks substantial progress towards “solving alignment.”

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-29T19:06:08.250Z · LW(p) · GW(p)

Control doesn't assume "no sharp left turns" and "the work you do on smaller systems will apply to the next ones you build", though things along these lines do help with control.

(For the most obvious interventions, the work seems very likely to be applicable at multiple model scales in practice.)

This might depend on exactly what you mean by "sharp left turn", I think people often use the term to point at several different different possibly correlated issues, but I think most of them can in principle be handled with control.

The two assumptions we need that could be broken by something like "sharp left turn":

Some continuity in capabilities in a small local regime: We do need to assume that we can scale things up by a small amount which only makes the AI somewhat smarter in order to ensure we can run evaluations safely. Or we need some other way of getting slightly continuous capabilties (e.g. by varying how long a given AI runs). However, the actual quantitative amount can be quite small and we can adjust how we do this depending on our observations. See here for some discussion.
Sample efficient training for capabilities evaluations that is robust to intentional sandbagging: we need to be able to train our AIs to do tasks in a reasonable sample efficient way in a way that is robust to AIs wanting to sandbag on the task. If sharp left turn implies "very low sample efficiency because capabilities now come from a non-retargetable process", then this would cause problems. In principle, this is a relatively nice problem to work on because you just want to make a number go up and you just want the AI to do as well as if it was trying. We also have some flexibility here because capability evaluation is easier than elicitation (footnote 8) [LW · GW].

↑ comment by Roko · 2024-01-27T12:47:16.975Z · LW(p) · GW(p)

This is a great comment, especially about P(Doom) being somewhat counterproductive because it is nonlocal and I think this deserves some signal boosting

comment by Lukas Finnveden (Lanrian) · 2024-01-24T18:54:24.456Z · LW(p) · GW(p)

I like this direction and this write-up of it!

If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.

Let's assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thinking, it could use that to do notably more serial thinking about how to exploit the current situation than any human has done (because some important features of the current situation is less than 3 months old: e.g. you had to switch to an importantly different programming language, or some different type of communication protocol between AIs, or change your AI safety research paradigm in a way you didn't expect). I'm curious how you'd ensure (or evaluate for) red-team competitiveness in this case.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-24T22:33:45.672Z · LW(p) · GW(p)

This is indeed a serious concern.

A key hope here is that with some amount of conservativeness our evaluation^[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren't the current situation). So, even if the red team doesn't have enough time to think about the exact situation in which we end up in the future, we might be able to verify that the protocol can generalize to different contexts in the future.

Alteratively, we can at least try to define operating conditions under which the situation is safe without further evaluations or reasoning.

For further discussion on this topic, see here. (This link is to a section of a draft version of a doc which we've made public. We'll post a final version of this doc at some point.)

Overall, we don't think the interventions I've described here make the situation overall clearly fine, but they seem likely to suffice for substantial reductions in risk given a large effort. So, the situation is indeed considerably worse if AIs can do a huge amount of uninterpretable serial reasoning.

(And our overall view on control is also something like "we don't think control alone makes the situation overall clearly fine, but control seems likely to suffice for substantial reductions in risk given a large effort". And there are many good objections to control (like this one), and our response to these good objections is typically "we have some countermeasures to this issue, these countermeasures make the situation considerably better than you might have thought, but they don't fully eliminate the problem".)

The red teaming in particular. ↩︎

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2024-01-25T23:42:26.765Z · LW(p) · GW(p)

I think this is very important, and this goes directly into the textbook from the future.

it seems to me that if we are careful enough and if we implement a lot of the things that are outlined in this agenda, we probably won't die. But we need to do it.

That is why safety culture is now the bottleneck: if we are careful, we are an order of magnitude safer than if we are not. And I believe this is the most important variable.

comment by habryka (habryka4) · 2024-01-25T19:03:52.773Z · LW(p) · GW(p)

Promoted to curated: I disagree with a bunch of the approach outlined in this post, but I nevertheless found this framing quite helpful for thinking about various AI X-risk related outcomes and plans. I also really appreciate the way this post is written, being overall both approachable while maintaining relatively much precision in talking about these issues.

Replies from: akash-wasil

↑ comment by Orpheus16 (akash-wasil) · 2024-01-26T10:48:32.835Z · LW(p) · GW(p)

I'd be curious to hear more about your main disagreements.

comment by joshc (joshua-clymer) · 2024-01-25T20:34:31.431Z · LW(p) · GW(p)

Thanks for writing this. I think it does a good job laying out the assumptions needed for control.

It seems like it could be challenging to justify that a red team is competitive -- especially if you are aiming to extract massive amounts of work from AI systems.

For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it's not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can't readily disentangle and understand. I'm imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case -- especially with dumb models like GPT-4.

It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I'd imagine this would apply for AI systems to achieve fundemental research advances -- e.g. in interpretability, formal verification, etc.

Curious about your thoughts.

Replies from: ryan_greenblatt, akash-wasil

↑ comment by ryan_greenblatt · 2024-01-27T03:57:07.168Z · LW(p) · GW(p)

Justifying control vs ensuring control

First, here's a general point which is relevant in responding to your comment.

There are different types of failures related to control:

We realize we can't reliably test and ensure control. This could be either: a. We realize that the necessary assumptions for control evaluations to work aren't met. In particular, we can't ensure the red team is sufficiently competitive and thus we can't trust control evaluations. b. We run control evaluations and we aren't able to develop sufficient countermeasures to ensure safety as measured by these evaluations.
We think the situation is fine based on control evaluations, but it's actually not and AIs succeed at egregiously bad actions.

Failure (1) is plausibly less bad than failure (2). And, (1.a) looks different than (1.b).

(See here for more discussion of different failure modes, but note that this is still an early version which is subject to change.)

So for failure (1) there is an important distinction between justifying/determining a red team is competitive (1.a) and ensuring that a red team is competitive (1.b).

We might be able to have reasonable-ish but somewhat approximate tests to get a rough sense of red team competitiveness and control evaluation viability more generally.

These tests are things like:

How much problematic planning can AIs do without this being detected in practice?
Is it likely that AIs can sandbag on our capability evaluations? More specifically, on which evaluations can they sandbag and to what extent?
Do AIs have qualitatively wildly superhuman abilities in some contexts to an extent that we should think that they likely have such abilities in other cases?

(A forthcoming post of ours will describe these tests in more detail. An early version of this post is here, but it doesn't describe these tests in sufficient detail yet and our thinking is still evolving.)

Some of these tests looks like standard control red teaming + capability evaluations (can the AI do problematic planning?) while some tests look more adhoc and might use things like model organisms, slightly related capability evalautions, and extrapolation based measurements (in particular testing for the ability to sandbag probably looks like this, we have some forthcoming work demonstrating this methodology).

Note that we expect that these "adhoc tests" are probably much less precise than the core capability evaluations of control. But this might be fine because in many cases (e.g. sandbagging), I expect we'll be able to have large safety margins.

Overall, I think we have some (IMO pretty reasonable) hopes for at least knowing that our red team won't realistically be competitive which will help substantially with avoiding failure (1). But, I can totally understand people being skeptical based on what we've legibly written up so far.

More specific responses.

Here's a summary of my understanding of your comment:

It could be difficult to justify (or to ensure?) that a red team is competitive. One specific case is that it could be hard to avoid AIs being able to come up with effectively superhuman plans due to total thinking time.
- This could apply even with heavily externalized reasoning: perhaps this externalized reasoning looks like a huge amount of overall words.
- Also, it's not clear that externalized reasoning is viable and thus might cause further issues. (I think this was something you were commenting on? I'm unsure?)
Hard to evalute work could break control (and maybe also break control evaluations?)

On avoiding effectively superhuman plans

As far as total thinking time resulting in superhuman plans which break red team competitiveness, let's first consider the case where there are absolutely no natural language bottlenecks (the AI isn't force to regularly bottleneck serial reasoning in natual language (externalized reasoning)). I address this in my earlier comment responding to Lukas [LW(p) · GW(p)]. Broadly, our hope will be to restrict effective time between natural language bottlenecks to a degree where maintaining competitiveness is doable.

(Also note that if SGD is decently sample efficient (which could be false!), AIs probably can't spend a high fraction of their cognitive resources in expectation thinking of plans for causing bad outcomes: this will be train out. They can potentially rarely spend a high fraction of resources and/or often spend a small/moderate fraction of resources.)

Now, suppose that we do have natural language bottlenecks. One concern you raised is that this will look like a huge amount of overall words of externalized reasoning and thus it will be hard to detect malign planning. We could hope to use GPT-4 (or our best trusted model), but this model might be too dumb. (Separately, I'm reasonably optimistic about finetuned version of GPT-4 level models being decent at detecting problematic reasoning. Mechanisms like untrusted monitoring could also help (discussed in our recent paper).)

Our hope for ruling out the AI often or always doing some reasoning about malign plans in externalized reasoning will be to randomly select traces of reasoning for detailed human review.

(I'm going to rehash some arguments we make in concentrated failures vs auditing failures [LW · GW], so readers should consider looking at that post for more context.)

Suppose we select 1% of reasoning "chunks" for review and our goal is to hit a 30x overall speed up on our researchers. Suppose each chunk is the value weighted equivalent of X hours of human researcher work. Then, the human reviewer needs to review it in less than ~3 X hours (assuming this is the vast majority of our human review budget).

One relevant concern here is that the AI will have way more context on the chunk of reasoning such that a human reviewer will have a very hard time understanding it. There are a variety of approaches to that could help mitigate this: make chunks bigger to amoritize the time the human spends obtaining context, have human specialize, force AIs to operate in a more legible fashion, and use AI assistance or other education/tools to speed up the human.

Note that this only rules out the AI often or always doing this malign natural language reasoning. The AI could still get away with rarely doing this (or doing this in a concentrated periord of time). (See concentrated failures vs auditing failures [LW · GW] for a more detailed description of concentrated failures.)

But, if the AI can only do this reasoning rarely, then it won't be able to think carefully about when it should be doing this reasoning. This makes it much harder for AIs to use complex strategies which involve specific opportunities or which involve coordination.

(We might worry about reasoning which is only done once but which is seen by a huge number of instances; our hope here will be to review anything which is seen by a huge number of instances. This should be ~viable as less than 1% of things will need to be seen by a huge number of instances.)

As part of our control evaluation, we would need to red team these countermeasures and try to verify that the AI can't spend sufficiently long planning that we're in trouble.

On hard to evaluate work

First, note that hard to evaluate work doesn't seem to break control evaluations. See Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy [LW · GW] and the "Evaluating for the occurrence of proxy bad outcomes" section of a longer (draft) post on making safety cases with control for more discussion. (In the terminology discussed here, imagine testing out control on string theory and using incorrect string theory results as the proxy bad outcome.) I could try to explain this if there was sufficient interest.

Now, indeed hard to evaluate work can make control difficult. At least, it makes it difficult to ensure high levels of usefulness in the case where models are scheming, though maintaining direct safety might not be that that much harder.

I think one relevant decomposition is:

Can humans evaluate the work of adversarial humans in the key areas we care about? (Let's suppose that all written artifacts are preserved, e.g. the full trace of all of the work include all intermediate notes and code. We won't force externalized reasoning for our humans.) Note that this allows for interventions like forcing the work to appear legible etc. I can discuss a concrete operationalization for this overall experiment.
Will AI work be importantly harder to evaluate than the above human case?

Personally, I feel like work in key areas is reasonably doable to evaluate when I think through specific cases.

Replies from: joshua-clymer

↑ comment by joshc (joshua-clymer) · 2024-01-28T05:34:38.651Z · LW(p) · GW(p)

Thanks, I'll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.

I'll specifically focus on justifying that the red team is competitive.

If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can't spend as much of their thinking on how to evade control measures as they can on safety research.

Suppose that the AI systems have approximately human capabilities and are sped up by 30x.

Suppose also that the human red team has spent 2 months identifying evasion strategies (call this 'coup research' following your control paper)

I'll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.

Suppose that we only want to justify that the red team is competitive for the next two months (and we'll continuously search for attack strategies in order to maintain this 1 to 1 ratio).

Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.

Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn't useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
Furthermore, suppose the AI systems' reasoning is externalized. Monitoring setups are very likely to catch coup research if it is present at a frequency greater than 1 / M.
min(1 / N, 1 / M) <= 1 / 30

I think I buy this argument or some variation of it.

↑ comment by Orpheus16 (akash-wasil) · 2024-01-26T10:47:15.352Z · LW(p) · GW(p)

Good point. I think it's helpful when people working schemes with the rough flavor of "we do X, and then X helps us get to a useful AI that does not takeover" try to specify roughly how capable they expect the "useful AI does not takeover" to be.

Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first "transformatively useful but still controllable" AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).

comment by Orpheus16 (akash-wasil) · 2024-01-25T12:19:55.079Z · LW(p) · GW(p)

I'm excited to see how the AI control research direction evolves.

After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:

Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the "what if the AI is just scheming and subverting the eval" concern.)
Claim 3: We (Buck and Ryan) have identified some specific techniques that we believe are promising for control (not intuitive; technical arguments needed).
Claim 4: If we succeed in controlling "transformatively useful AI", then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).

I'd be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4.

Some more on #2 & #4:

The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.

I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.

In some ways, this feels analogous to the following situation:

Alice from the Scaling Department: Look, our current "alignment" techniques are working for GPT-4. So let's scale up more! I get that at some point we won't be able to align the AI, but we're not there yet, so we should be able to keep going.
Bob from the Preparedness Team: Well, the problem is we don't know when our alignment techniques will stop working... like, it's quite plausible that the next AI system will be able to game our evals.
Alice from the Scaling Department: OK, well let me know when you have clear evidence that our alignment techniques or dangerous capability evals will fail.

Here's the analogy for control:

Alice from the Scaling Department: Look, our current control techniques are working for GPT-4. So let's scale up more! I get that at some point our control evals aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going.
Bob from the Preparedness Team: Well, the problem is we don't know when our control evals will stop working... like, it's quite plausible that the next AI system will be better at coming up with attacks than our red-team.
Alice from the Scaling Department: OK, well let me know when you have clear evidence that our control evals will fail.

I'd be curious to hear more about how you're thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:

How do you react to the analogy above? Do you expect that control evals will do a better job than other types of evals at determining when we have reached the point where scaling further is (directly/imminently) dangerous? Or more simply, will it be clear when we need to stop?
More broadly, what do you think we should do once we're starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals").

Replies from: ryan_greenblatt, ryan_greenblatt, ryan_greenblatt, ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-27T05:00:46.671Z · LW(p) · GW(p)

Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals").

We're thinking about systems which are capable of autonomously doing nearly all key research tasks. In particular, speeding up R&D by 30x probaly requires heavily autonomous AIs running experiments, thinking of ideas, writing explanations, and thinking about next steps.

I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)^[1], but this will still require the AIs to really understand the ideas and think about various tradeoffs, etc.

So, the overall composed system will be as broadly capable as a quite good human researcher, though in some cases, we might be heavily using specialization or reasoning patterns which aren't like typical human approaches. (E.g. using large amounts of search on relatively weak models to come up with ideas.)

We expect this system to be built out of AIs which run considerably faster than humans in effective serial cognitive speed (perhaps 5-100x serially faster than humans depending on the exact AI and the task) and which will also be run massively in parallel. (See here [LW · GW] for some more discussion.)

In some sense this is just an observation that implementing ML experiments correctly is hard. : ) ↩︎

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-01-27T09:35:24.478Z · LW(p) · GW(p)

I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)

Why doesn't compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-27T17:18:35.405Z · LW(p) · GW(p)

I guess I'm not that sold that compute will actually be that much of a key bottleneck for experiments in the future in a way that can't be overcome with 2x additional labor and/or a 2x slow down.

Like in many cases you can spend additional labor to reduce compute usage of experiments. (E.g., first run the experiments on smaller models.) And, we're conditioning on having really powerful AIs which correlates with a high baseline level of compute and that will help whenever we can run experiments at small scales and then use scaling laws etc. Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training [LW · GW].

This might depend on what type of ML research we're talking about.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-01-28T16:36:00.235Z · LW(p) · GW(p)

I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).

I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained model, e.g. maybe (1) finetuning starts taking fewer data points as model size increases (sample efficiency improves with model capability), and so finetuning runs become a rounding error on compute, and (2) the vast majority of ML research progress involves nothing more expensive than finetuning runs. (Though in this world you have to wonder why we keep training bigger models instead of just investing solely in better finetuning the current biggest model.)

Another thing that occurred to me is that latency starts looking like another major bottleneck. Currently it seems feasible to make a paper's worth of progress in ~6 months. With a 30x speedup, you now have to do that in 6 days. At that scale, introducing additional latency via experiments at small scales is a huge cost.

(I'm assuming here that the ideas and overall workflow are still managed by human researchers, since your hypothetical said that the AIs are just going from high level ideas to implemented experiments. If you have fully automated AI researchers then they don't need to optimize latency as hard; they can instead get 30x speedup by having 30x as many researchers working but still producing a paper every 6 months.)

(Another possibility is that human ML researchers get really good at multi-tasking, and so e.g. they have 5 paper-equivalents at any given time, each of which takes 30 calendar days to complete. But I don't believe that (most) human ML researchers are that good at multitasking on research ideas, and there isn't that much time for them to learn.)

It also seems hard for the human researchers to have ideas good enough to turn into paper-equivalents every 6 days. Also hard for those researchers to keep on top of the literature well enough to be proposing stuff that actually makes progress rather than duplicating existing work they weren't aware of, even given AI tools that help with understanding the literature.

Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training.

Tbc the fact that running your automated ML implementers takes compute was a side point; I'd be making the same claims even if running the AIs was magically free.

Though even at a billion token-equivalents per second it seems plausible to me that your automated ML experiment implementers end up being a significant fraction of that compute. It depends quite significantly on how capable a single forward pass is, e.g. can the AI just generate an entire human-level pull request autoregressively (i.e. producing each token of the PR one at a time, without going back to fix errors) vs does it do similar things as humans (write tests and code, test, debug, eventually submit) vs. does it do way more iteration and error correction than humans (in parallel to avoid crazy high latency), do we use best-of-N sampling or similar tricks to improve quality of generations, etc.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-28T17:18:47.853Z · LW(p) · GW(p)

Overall, this has updated me to some extent and it seems less plausible to me that ML research can achieve 30x speedups while having human researchers do all of the high level ideas. (I think the picture looks importantly better when AIs are removing this bottleneck.)

I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically

The situation I was imagining is where most experiments use some combination of:

A relatively small amount of finetuning/inference on the biggest models (including for the actual AI researcher)
Larger (possibly full) training runs, but at much smaller scale (e.g. GPT-3 level performance models)

Then, we can in total afford ~training dataset sized amounts of finetuning/inference for the biggest models (by the inference availability argument). And GPT-3 performance experiments will be pretty cheap. So assuming our base model looks like GPT-6 with the expected compute requirement and model size, this is a huge amount of possible inference availability.

So, the implicit claim is that compute costs scale much less than quadratically. It's certainly not obvious ML research can be progressed fast enough with this little compute.

Another possibility is that human ML researchers get really good at multi-tasking

I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.

Currently it seems feasible to make a paper's worth of progress in ~6 months.

Note that this often involves multiple people working on the same paper. In the AI case, the division of labor might look at least somewhat different. (Though I don't think this changes the picture very much from what you're describing because most people now aren't the "ideas" people.)

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-01-28T18:23:11.060Z · LW(p) · GW(p)

Cool, that all roughly makes sense to me :)

I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.

Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, ...)

Suppose today I gave you a device where you put in moderately detailed instructions for experiments, and the device returns the results^[1] with N minutes of latency and infinite throughput. Do you think you can spend 1 working day using this device to produce the same output as 4 copies of yourself working in parallel for a week (and continue to do that for months, after you've exhausted low-hanging fruit)?

... Having written this hypothetical out, I am finding it more plausible than before, at least for small enough N, though it still feels quite hard at e.g. N = 60.

^{^}
The experiments can't use too much compute. No solving the halting problem.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-28T18:38:13.636Z · LW(p) · GW(p)

Probably yes for realistic values of N?

Assuming the box is pretty smart at understanding instructions (and has an understanding of my typical ontology to the extent that you would get after working with me a few weeks and reading various posts) and the box will ask follow-up questions in cases where the instructions are unclear. (And we can do small diffs with reduced latency like asking the results to be plotted in a different way.)

My main concern is running out of ideas after a while despite copies of myself with more thinking time having more time to generate ideas.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-01-28T18:44:56.995Z · LW(p) · GW(p)

Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).

EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.

↑ comment by ryan_greenblatt · 2024-01-25T18:02:23.699Z · LW(p) · GW(p)

I'll respond to this more later, but first I'd like to clarify a narrow point about when building superintelligence is acceptable.

Claim 4: If we succeed in controlling "transformatively useful AI", then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).

I think that purely black-box AI control is very unlikely to scale to superintelligence, but there are other plausible routes to safety (alignment, various approaches which make usage of the internals of models but which aren't well described as alignment).

So, the condition is more like "when we can be confident in the safety of superintelligence" and there are a variety of routes here. Separately, if we don't have an absolutely air tight argument for safety, we would ideally delay building superintelligence if it wouldn't be needed to solve pressing problems. (And indeed, we don't think that superintelligence to solve a wide range of problems.)

Replies from: None

↑ comment by [deleted] · 2024-01-25T20:28:06.744Z · LW(p) · GW(p)

What do you think about marginal superintelligences?

For example, take the task "implement an x86 CPU as gate masks". Humans can do this task, but no single human can do the task, and humans are forced to inefficiently subdivide the task. For example a "CPU" that did not have distinct internal buses but was some blob of gates, with the registers and cache right in the middle of the logic mess (or with cache lines descending from dies soldered above), would probably outperform all current designs.

This hypothetical mess of a chip design is not something humans can create but it's a checkable artifact.

Or another task, "construct a new human kidney from cells. All measurable parameters must meet or exceed a reference kidney". Similar argument - humans can't quite do this, the complexity of life support during construction is where humans would fail, or human made designs wouldn't quite work well enough.

But again this is a checkable artifact. You don't need superintelligence to validate that the output satisfies (or fails to satisfy) the goal.

A marginal superintelligence would be one that is context unaware and gets assigned tasks like this. It doesn't know when the task is real or not.

↑ comment by ryan_greenblatt · 2024-01-27T04:52:03.544Z · LW(p) · GW(p)

More broadly, what do you think we should do once we're starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?

We think it's likely that prior to finding that evals are unworkable, we find that control techniques would fail to ensure reasonable levels of safety and usefulness. That is, in order to ensure safety for typical use cases, we have to impose heavy restrictions that heavily restrict usefulness.

See this comment [LW(p) · GW(p)] for more discussion of what could fail.

Regardless of what fails first, we can discuss what we should do.

First, you'd notify the key stake holders that control doesn't suffice for the next AIs. (And ideally produce legible demonstrations of this etc.)

If we've already reached a sufficient level of AI output that we're operating at pretty crazy and near maximal speeds (e.g. the Amdahl's law style argument here) [LW · GW], then we're in pretty good shape from the perspective of a control plan (from a technical perspective). We'd hope that in this position we can ensure some reasonable delay prior to being forced to build and use AIs in ways that break control. We might be forced to build more powerful AIs in this situation because of other actors building insanely powerful AIs that we might need to engage in active conflict against (and we can't just stop these AIs from being trained which might be way less risky) and it could also be that internal forces want to irrationally rush to making more powerful AIs.

In the case where we're not operating at sufficiently high output, the situation is considerably worse. If the AI lab is operating in a sensible way, then the hope would be that you do a careful tradeoff between likely exogenous risk and the risk that it seems like you would be incuring from various usages of the AI^[1]. So if exogenous risk isn't there, we'd just stop and work on developing better approaches. If we need to proceed without further progress, than we'd just try to make the best safety usefulness tradeoffs that we can afford.

Given reasonable guesses at P(scheming) for different AIs and more detailed views on: whether various AIs might scheme but not cooperate with each other and how bad various bad outcomes would be relative to the exogenous threats (e.g. how does an AI escaping from our lab compare in badness to various exogenous risks). ↩︎

↑ comment by ryan_greenblatt · 2024-01-27T04:21:55.028Z · LW(p) · GW(p)

I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.

A key hope here is that we can use various test to get a general sense for whether AIs can importantly outcompete our red team. See this other comment [LW(p) · GW(p)] for more discussion.

So, here's how I would hope the actual situation with Alice and Bob goes:

Alice from the Scaling Department: Look, our current control techniques are working for GPT-5. So let's scale up more! I get that at some point our control techniques aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going.

Bob from the Control Red Team: Indeed, our current control techniques are very likely to be working. Our protocol suffices for controlling current models on the currently approved applications assuming we remain vigilant and maintain the necessary invariants (our design basis).

Part of why we think control evaluations currently work for assessing safety is that we we performed a variety of capability evaluations which indicate that as long as we prevent steganography in some key places, AIs are pretty unlikely to think of better attack plans than we thought of as the red team. This involves looking at how good AIs are at generating strategies in a bunch of different contexts and then doing some analysis which allows for us to transfer this to an overall conservative guess about the AI's abilities to create good attack plans against our actual protocol (but we can't necessarily produce the AI's best guesses at attack plans, we're utilizing that capability evaluation is easier than elicitation as discussed in footnote 8 [LW · GW]). We also have rough and conservative tests for the extent to which AIs can sandbag on our current evaluations and that seems to be fine for now.

From our current rough extrapolations, the first place where we'll start running into issues with control as we build more powerful models is that we'll need to start imposing heavy restrictions on AI affordances in domains XYZ which will really cut into the usefulness (relative to completely YOLO-ing it). The blue team is hoping that they can finish developing some countermeasures that will handle this problem, though it's unclear when this will be ready. After that it looks like exploration hacking concerns will force us to be much more conservative in the evaluation in a few places which will probably put a considerable dent in usefulness in a few other domains. Shortly after that, building more powerful models probably won't be worth it because the costs associated with training and sufficient containment will overwhelm the marginal improvements in usefuless we can get (given our current safety targets); the AI will only be usable in extremely restricted domains like submitting formally verified code or doing narrow mathematics. Of course, our larger scale extrapolations are crude and we'll have a much better sense of the overall situation once we run our tests on the next few generations of model.

Alice from the Scaling Department: OK, well let me know when these evaluations start indicating that training further models isn't worth it.

(Ideal) Alice from the Scaling Department: We're already starting to pivot into making easier to control AI architectures more scalable and performant (easier to restrict and more bottlenecked on natural language reasoning). We're also working on making inference cheaper at a fixed level of capability so that in some cases we can run dumber models for longer rather than using smarter models. This seems like it will be safer based on our current control evaluations. Of course, we're worried about the potential overhang due to this algorithmic progress, and we're pretty confident that competent adversaries couldn't steal it.

comment by RogerDearnaley (roger-d-1) · 2024-01-26T06:25:39.875Z · LW(p) · GW(p)

Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).

Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone's public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don't see any obvious reason to expect the answer to this to be "yes" very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it's hard to completely prove other than by trial and error.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-26T06:57:36.962Z · LW(p) · GW(p)

I basically agree (and we discuss this a bit in the appendix section "Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness").

However it seems worth noting that it might be relatively easy for humans to model this in an ok-ish manner by having access to the Internet, multiple people working together, and using gpt-4 in some cases and then thinking about the possible gaps. Then, if they think of plans which can utilize very broad thinking, they could try train train the model to execute on these sorts of plans.

One reasonable objection here is that this team of people might be relatively less good at thinking about how to use broad knowledge and skills compared to the model (even if the team can in principle emulate the range of skills). If this was true, then the model could think of approaches which the human team wouldn't think of (which would cause issues unless we could somehow elicit this with RL). But, it's somewhat unclear if this would be true: while the model is trained such that it possesses very broad knowledge and skills, it's not clear that by default training will result in it learning to utilize these abilities much better than a human team. For instance, current gpt-4 is quite bad at exploiting it's unique advantages. So, especially if we don't need models to be too good at utilizing their superhuman breadth, we might be able to avoid running into issues here.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-01-26T19:12:15.385Z · LW(p) · GW(p)

Agreed. I don't think this is a fatal objection, just an example of the sort of careful thinking that would be required in order to exert significant control over an LLM that is not generally superhuman, but does have a skill profile that is far broader than any individual human: more comparable to a very large team of humans from which you can draw appropriate specialists or they can brainstorm together.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-26T20:23:41.644Z · LW(p) · GW(p)

I strongly agree that a bunch of careful thinking and red teaming will be likely required for control to work on early transformatively useful AI.

comment by ryan_greenblatt · 2024-01-27T20:43:31.948Z · LW(p) · GW(p)

We now think that this post underemphasized the extent to which we think that control should focus on the lead-up to the first time an AI gets caught. So I added this paragraph to the introduction:

We think the control line-of-defense should mostly focus on the lead-up to the first time an AI gets caught trying to cause bad outcomes. So we'd aim to ensure the property "if an AI is scheming, it won't successfully cause a catastrophe before we catch it". That is, it's acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when evaluating control, we should count catching an AI red-handed as a win condition. Our core argument for this is that once the AI is caught the situation changes substantially; most importantly, it's now clear that there are serious misalignment problems and thus shutting down until we have credible solutions is far more likely. We discuss these arguments in more detail in our recent post Catching AIs red-handed [LW · GW].

I also made a few other tiny edits to make this emphasis more clear.

Replies from: Buck

↑ comment by Buck · 2025-03-14T00:24:44.773Z · LW(p) · GW(p)

I am not sure I agree with this change at this point. How do you feel now?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-03-14T00:35:15.538Z · LW(p) · GW(p)

I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."

comment by Joe Collman (Joe_Collman) · 2024-01-26T07:27:21.919Z · LW(p) · GW(p)

Great post (I've not yet read it thoroughly, or thought for long).

The first concern that springs to mind:

I expect a strong correlation between [humans are bad at [subversive strategy x]], [humans don't tend to think of [subversive strategy x]], and [humans don't tend to notice [subversive strategy x]].
- My worry is more that we miss strategies that we're bad at, than strategies the AI is extremely good at. It seems plausible for us to be entirely unaware of the existence of some subversive strategies - since they've never been human skills, and so don't appear on our map.
- It seems to me that the near-term danger is better described as [subversive strategies that don't occur to humans], rather than [superhuman subversive strategies]. The latter will put people in mind of [strategies humans use, only better] - genius hacking/persuasion etc. I also want people considering [strategies that don't occur to humans at all].
On a minor-but-related note, I'm not too keen on writing/thinking in terms of "problematic domains" rather than "problematic skills" / "problematic strategies". There's no necessity for a subversive strategy to map nicely into something I have a label for - e.g. "hacking", "persuasion".
- If we need to rule out all problematic strategies, I want to avoid language/thinking that may stop us considering problems outside our usual categories.
- (I don't think you're saying anything substantively imprecise - here I'm only concerned about language and blind-spots)

One last thing that caught my eye:

...but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science...

This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always. E.g. being persuaded of x when it's likely that the rest of your tribe will be persuaded of x may well be helpful-to-your-genes, regardless of the truth of x or of the validity of the arguments. Humans were selected to believe useful fictions.

I note also that there's a big difference between [human x has the skills and understanding necessary to resist being persuaded of y] and [human x will in fact resist being persuaded of y].

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-27T20:56:19.880Z · LW(p) · GW(p)

My worry is more that we miss strategies that we're bad at, than strategies the AI is extremely good at

A decent amount of my hope here is that we can at least get a sense for whether the best AI strategies in some reasonably similar cases look extremely bizarre and were totally unpredictable to humans. E.g., we might do experiments where we do huge amounts of RL in a variety of different environments that feel somewhat analogous and then see if this finds bizarre strategies.

Separately, I'm somewhat skeptical of the best AI strategies being totally different from what humans can think of though I can see where this concern is coming from.

See here for more discussion.

On a minor-but-related note, I'm not too keen on writing/thinking in terms of "problematic domains" rather than "problematic skills" / "problematic strategies".

Yep, we indeed need to worry about any qualitatively wildly superhuman and problematic strategy/skill. We mean "domain" in a somewhat broad sense.

Part of our argument for avoiding qualitatively wildly superhuman skills that are problematic does correspond to there being only a small number of domains and where skills within those domains are reasonably correlated (in the absense of agressive domain specific training). So, using the term domain is pretty natural from this perspective. But, it seems like you might not buy this perspective in which case it seems like an unnatural frame.

This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always.

Agreed, but worth noting this isn't really an assumption, I'm just making a rough argument which seems moderately compelling to me. (And FWIW it still seems moderately compelling despite this objection.)

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2024-01-28T07:13:15.744Z · LW(p) · GW(p)

Thanks for the link.

I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)

I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-24T19:37:16.585Z · LW(p) · GW(p)

I think this is an important piece of the strategic solution. The lack of control options for open-weight models is a big part of why I think there is a much lower capability threshold for catastrophic danger from an open-weight model.

Replies from: ryan_greenblatt, None

↑ comment by ryan_greenblatt · 2024-01-28T17:54:59.377Z · LW(p) · GW(p)

I basically agree with you about transformatively powerful open-weight models making the situation much more dangerous such that control is unlikely to suffice.

But, I think there are important control related options for open-weight models which do improve the situation to some extent.

In the open-weight regime, we realistically won't be able to rule out AIs doing a small to moderate number of arbitrarily bad actions. (Either based on the aims of the humans running the AI or the aims of a scheming AI itself.) So, if the offense-defense balance is such that this is catastrophic, we're absolutely in trouble. (E.g., if transformatively powerful AIs can pretty easily create catastrophically dangerous bioweapons without using much compute or needing very skilled human labor we're in big trouble unless we can somehow avoid this by restricting access to labs or something.)

But, we can still ensure:

If most compute is controlled by smaller number of reasonable actors and these actors implement control (or other sufficient strategies), then due to compute limitations, AIs won't be able to do an "arbitrary" number of bad actions.
- Compute limitations might suffice for preventing rogue AIs or rogue humans from rapidly causing an intelligence explosion even if it would otherwise be possible. Thus, we might be able to delay substantially and use this time to improve the situation.
- It might also suffice for avoiding the worst parts of AI enabled cyber attacks which might be heavily scale dependent and not overally offense dominant.
We can control the AIs used by this smaller number of actors which could allow them to be used in otherwise dangerous applications.
We can try to distribute easy-to-use control approaches widely and possibly we can pass legal restrictions related to these approaches.
We can try to aggregate information to figure out if the AI is scheming and if we learn that it is, we can try to take some countermeasures. TBC, even if the AI isn't scheming, it's plausible that offense defense issues (given the lack of time for adaptation) suffice to make this situation unrecoverable.

Overall, my guess is that if the AI is open source, scheming, and transformatively powerful (using the sort of power threshold defined in this post) and our overall alignment related technology looks like it does today, the probability of human loss of control or >1 billion dead is about 80%.

If we have a magic solution to misalignment which makes it easy for anyone to avoid egregious misalignment from their perspective and which scales indefinitely (but such an AI is still open source), then my probability on the bad outcome condition above is more like 20%.

My probabilities are somewhere in-between depending on the exact amount of technical solution and the default level of misalignment. (And various responses from governments also affect this etc.)

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-29T04:15:42.903Z · LW(p) · GW(p)

I agree with pretty much all of this and appreciate your clear framing of the issues at hand. It seems like where our concerns differ is around these two issues: I believe that the offense-defense balance for AI-enabled biorisk is such that a bad actor with open-weights fine-tuned model could kill billions of people with less than 100k. I don't think that the fine-tuning or inference would require more than a single server with 8x GPUs (potentially even just 8x 4090s). So unless the compute regulations are monitoring individual 4090 GPUs, then you aren't blocking inference or fine-tuning. Training takes a bunch of servers (often millions or billions of dollars worth of hardware), and thus seems more plausible to monitor.

I haven't heard any concrete prposals for compute monitoring at the level of 8x 4090 GPUs, have you?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-29T17:37:16.764Z · LW(p) · GW(p)

I think compute monitoring for literally 8x 4090 is very likely to be way too hard against a reasonably commited adversary. (The hardware is already way too easy to access and broadly distributed. Also, I think if this causes problems, you might also have issues with people's mac books which is super costly.)

My overall guess is that for realistically sized models (0.1-100 trillion parameters), we can't prevent doing a small amount of finetuning and inference against an ML competent adversary.

But, maybe there is some hope for models which are more like 100 trillion parameters? (Minimally, I think 8x 4090 isn't going to work well for models of this size.)

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-29T17:47:22.872Z · LW(p) · GW(p)

Yeah, the question of where the threshold of dangerous capabilities is seems very important to whether we can hope to make compute governance a part of restricting bad actors from using fine-tuning & inference to do bad things.

The reason this is important to try to predict is because we really want to not release powerfully dangerous open-source models which bad actors could use for bad things. Once those models have been released, there's no taking them back. So if there's a risk that a given model would be dangerous if released openly so that bad actors could have a private copy to fine-tune, then the intervention point is to disallow the release of the model.

↑ comment by [deleted] · 2024-01-25T02:52:22.079Z · LW(p) · GW(p)

I have been interpreting control to mean "when you use the model for some task, use strategies that will cause the final output to be correct even if you suspect the model may "want" to sabotage it". Alignment would be making the model "want" to help.

In real life we can "control" prisoners, we know most likely "want" to escape, but as prisoners don't have the tools to defeat the various defenses this usually doesn't happen. Also, the worst case - the prisoner escapes and commits several murders before being stopped - is an acceptable loss to a civilization so long as pEscape is low.

Suppose an escaped prisoner wants to work for you. Can you "control" them and get useful work out of them? Yes. It depends on the task and it depends on what tools you have available, but this is generally possible.

For AI specifically this strategy has been dismissed, I think because many of the people doing the dismissing were imagining somewhat generic future AI and not considering the specific details. For example one of the strongest methods for control is to carefully filter inputs, ideally removing all unnecessary bits - what Ryan talks about when he write "paraphrase " but I bet we can do much better than that - so the model doesn't know that now is the time to betray. And erase the models memory after a task, and keep each task as short as possible. (A few hours of wall time total, model doesn't think between inputs)

This specific method is not something you can do with humans, and if you imagine AGI are like a human brain but smarter, then this method won't be considered.

Theoretically if most tasks that require the model to have authority are using "controlled" models, then it's acceptable if open models exist, so long as they are unable to send messages that will be perceived and negotiate with other AI under human "control".

If all the weapons and infrastructure etc are in the hands of humans directly or models they "control", open models will be limited to whatever resources whoever runs the AGI is stupid enough to give them. (Running potentially sabotaged AI downloaded from a torrent sounds like a Darwin award)

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-25T17:04:46.802Z · LW(p) · GW(p)

Darwin awards do happen. Lots of different humans means some are dumb sometimes.

Replies from: None

↑ comment by [deleted] · 2024-01-25T18:27:30.124Z · LW(p) · GW(p)

Absolutely. Main thing is one guy in his garage who hooks up an open source agi to some hand built robots that start assembling more of themselves is a totally different scale of threat than a multi square kilometer...or cubic kilometer....optimized factory that is making weapons.

Billions of times scale difference. And this points to a way to control AI - so long as the humans have a very large resource advantage, ideally billions of times, that they achieved with the help of "controlled" AI models, the humans can defend themselves against superintelligence.

But you won't get a billion fold advantage with any AI pauses - you need a crash program to accelerate developing ai. You won't get there without ai either, we don't have enough living humans and besides, these large factories won't have any concessions to humans. No hand rails, probably no breathable gas inside, etc.

comment by scasper · 2024-01-26T00:31:43.418Z · LW(p) · GW(p)

I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.

First, the appendix hinges somewhat on this point.

For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model.

I agree with this, but saying that "control evals should be black-box" is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.

Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning -- and all of these things actually seem plausibly useful for control evals. Plus I don't see why interp should be off limits for evaluators who know the limitations of the tools they are using.

Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I'd bet on the success of the team that has more access.

So if I understand things correctly, and if the intended point of this appendix is that "Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls," I would recommend just saying that more directly.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-26T01:00:34.180Z · LW(p) · GW(p)

Our claim is that when arguing that models are controlled, we should (with the current state of technology) probably not depend on arguments for safety which depend on properties of the internals of models other than the basic property "SGD is reasonably sample efficient".

(For a more precise notion of the sample efficiency point, if you train the model on labeled data, the model will be forced to perform near maximally well at fitting this data in a reasonably sample efficient way. This is due to assuming the model isn't capable of gradient hacking. This argument also implies some claims about what properties of RL you can depend on. We might also be able to evaluate whether models are capable of gradient hacking or look at concerns which are somewhere between gradient and exploration hacking.)

Another sense in which control evalutions aren't exactly "black-box" is that these evaluations will use finetuning: the red team will finetuning the model to see if it is capable of exhibiting the problematic behavior. (And our actual protocol that we're evaluating will indeed use finetuning, it's just that we can't depend on any properties of this finetuning other than sample efficiency property mentioned above.)

You might find it illuminating to read Appendix C of our recent paper.

So overall, it does seem like the title of that appendix could be misleading/confusing. Here's an alterative description (which is a bit too long to be a title) "With the current state of technology, control evaluations should conservatively avoid depending on properties of the internals of models (e.g. interpretability) or generalization properties other than sample efficiency of SGD". I'll try to think of a less confusing and short title (and edit the section to be less confusing accordingly).

Edit: I'll also generally try to be a bit more careful about the claims we make in this section. Like we're not claiming that current internals/generalization-based methods provide no evidence for safety given an absence of serious evidence for danger, just that the evidence is pretty limited on top of various baselines for a few reasons.

Edit Edit: I've changed the title of the corresponding section to "Why control evaluations are conservative about inductive biases and AI internal structure" and added a few footnotes trying to clarify what we mean by "black-box evaluation".

Responding more specifically to what you're saying:

So if I understand things correctly, and if the intended point of this appendix is that "Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,"

Yeah, this is basically the right read.

Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning -- and all of these things actually seem plausibly useful for control evals.

We're certainly not only talking about reverse engineering in that section, we're talking about any potential assumption on internal structure (e.g. the sort of assumptions that gradient based attacks would depend upon to do anything).

Separately, we agree that you should of course try to use gradient based attacks and similar methods to improve safety and get evidence for misalignment. We're just arguing that the absence of evidence for misalignment from gradient based attacks doesn't constitute much evidence for safety.

Like consider the following situation:

We run gradient based attacks on our current model and this doesn't turn up clear evidence for egregious misalignment.
When a red team produces a model naively for control evaluations, sometimes gradient based attacks work to find cases which cause that model to behave poorly. (In ways which don't apply to the original model that the red team finetuned)
However, if the red team does some random manipulations of the internals, graident based attacks totally fail.

Our proposal for how you handle this situation is "don't consider the absence of evidence from gradient based attacks to be much evidence for safety". (Separately I'm skeptical that gradient based attacks will work in the test as described in the second bullet.)

Replies from: scasper, None

↑ comment by scasper · 2024-01-26T01:54:10.865Z · LW(p) · GW(p)

Thanks for the reply. I think this resolves the thread and puts us on the same page :)

↑ comment by [deleted] · 2024-01-26T06:00:29.319Z · LW(p) · GW(p)

Do you think models like this need to be kept static? For example if you allow the model to learn from its mistakes, etc, after this evaluation, or receive additional networks added to its architecture, doesn't this negate any safety evals?

I ask because this is the obvious way to use ai in a business. Subscribe to a base model with unlocked weights and add additional networks optimized for RL or domain specific estimations.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-26T06:44:24.228Z · LW(p) · GW(p)

No, see appendix C in our paper for a discussion of how to handle online training while being conservative about inductive biases.

(We can also approximate this as needed if that exact scheme is too complex or expensive.)

comment by Chris_Leong · 2024-01-27T06:49:04.089Z · LW(p) · GW(p)

Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-27T17:08:39.417Z · LW(p) · GW(p)

I think RLHF probably doesn't particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.

And this competence will decompose into:

Actually doing the task well
Making it look like the task was done well even when it wasn't

The second one isn't a capability we want, but it doesn't particularly look like persuasiveness IMO.

comment by lc · 2024-01-27T03:17:54.631Z · LW(p) · GW(p)

Just wanted to mention that this title is really bad and I thought you were trying to say something like "the case against misaligned AI". I only ended up clicking on the post because it was curated.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-01-27T04:32:29.115Z · LW(p) · GW(p)

If any readers passing by have better ideas for titles, I would be interested, though it's maybe too late to change. (I'm also not sure how bad this title is for most people, interested in what other people think.) I'll try to think of better ideas at some point.

Thanks for letting us know.

comment by Aaron_Scher · 2024-02-08T19:31:26.982Z · LW(p) · GW(p)

This is great, thanks for writing it!

Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: "Improve capability elicitation techniques". Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers' job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored.
[Unsure if the ways I'm thinking about this are productive/correct, but I think it's broadly correct] One slightly non-obvious component of "Your sensory inputs are unreliable" is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it's in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts).
1. By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup).
2. Example in the HHH context: "User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include..."
3. I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we're teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to "I'm sorry I can't help with that", at every single chance you get.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-02-08T19:35:57.440Z · LW(p) · GW(p)

Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: "Improve capability elicitation techniques".

Agreed, we think improving capability elicitation techniques will also improve our ability to prevent exploration hacking/sandbagging. So work trying to generally improve capability elicitation will be useful.

Preventing (basically intentional) sandbagging is just a special case of capabilities elicitation, but it is pretty much the only case which we care about for control.

comment by RedMan · 2024-02-05T17:31:12.853Z · LW(p) · GW(p)

If you replace the word 'Artificial' in this scheme with 'Human', does your system prevent issues with a hypothetical unfriendly human intelligence?

John von Neumann definitely hit the first two bullets, and given that the nuclear bomb was built and used, it seems like the third applies as well. I'd like to believe that similarly capable humans exist today.

Very dangerous: Able to cause existential catastrophe, in the absence of countermeasures.
Transformatively useful: Capable of substantially reducing the risk posed by subsequent AIs[21] if fully deployed, likely by speeding up R&D and some other tasks by a large factor (perhaps 30x).
Uncontrollable: Capable enough at evading control techniques or sabotaging control evaluations that it's infeasible to control it.[22]

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-02-05T18:43:56.156Z · LW(p) · GW(p)

Yes, we often think about what would happen if we applied this system to humans, more specifically uploaded humans (aka brain emulations, aka EMs). This seems like a useful intuition pump.

We think control would likely be workable for uploaded humans which run considerably faster and cheaper than normal humans (e.g. 30x faster and much cheaper).

The main difference with the human case is that you can't necessarily depend on training the system.

We typically imagine "humans uploads, but you can train them (with SGD)". (Control can probably still work (up to the point where we catch the AI [LW · GW]) without the ability to train in the literal human case, but a bunch of considerations come up which which likely don't apply to AIs.)

comment by Leonhard Piff (leonhard-piff) · 2025-04-14T16:37:59.132Z · LW(p) · GW(p)

train them to accomplish the tasks necessary to implement each attack strategy.

It seems to me misguided to take these potentially dangerous models then train them to display dangerous behaviours. I prefer using proxy measures.

comment by Ram Potham (ram-potham) · 2025-04-11T21:32:04.586Z · LW(p) · GW(p)

To clarify, is the optimal approach the following?

AI Control for early transformative AI
Controlled AI accelerate AI Safety, create automated AI Safety Researcher
Hope we get ASI aligned

If so, what are currently the most promising approaches for AI control on transformative AI?

comment by Will_Pearson · 2025-03-30T13:25:28.544Z · LW(p) · GW(p)

An interesting thing to do with this kind of approach is apply it to humans (or partially human systems) and see if it makes sense and it makes a sensible society. Try and universalise the concept, because humans out partial human systems can pose great threats, e.g potentially memetic like Marxism . If it doesn't make sense of leads to problems like who controls the controllers (itself a potentially destabilising idea if applied on the large scale) then it might need to be rethought.

comment by Review Bot · 2024-02-22T00:01:57.700Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by StefanHex (Stefan42) · 2024-01-31T21:42:06.305Z · LW(p) · GW(p)

My two bullet point summary / example, to test my understanding:

We ask labs to implement some sort of filter to monitor their AI's outputs (likely AI based).
Then we have a human team "play GPT-5" and try to submit dangerous outputs that the filter does not detect (w/ AI assistance etc. of course).

Is this (an example of) a control measure, and a control evaluation?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-02-01T00:38:57.809Z · LW(p) · GW(p)

Yes, this seems like an example of a control measure and a control evaluation.

However, the situation looks less like "human team plays gpt-5" more like "human team trains gpt-5".

(And there are additional complications once the blue team starts training the weak monitor based on gpt-5 outputs, then the red team needs to be able to demonstrate exploration hacking approaches.)

comment by MiguelDev (whitehatStoic) · 2024-01-25T02:29:29.109Z · LW(p) · GW(p)

Awesome post here! Thank you for talking about the importance of ensuring control mechanisms are in place in different areas of AI research.

We want safety techniques that are very hard for models to subvert.

I think that a safety technique that changes all of the model weights is hard (or impossible?) for ~~models~~the same model to subvert. In this regard, other safety techniques that do not consider controlling 100% of the network (eg. activation engineering) will not scale.

^{^}
Changed "models" to "the same model" for clarity.

comment by Kajus · 2025-01-22T15:14:26.228Z · LW(p) · GW(p)

If a system is scheming then its usefulness is zero. It's like talking with a devil. It shouldn't be run at all. Why would it help us with anything? You won't solve alignment with an AI that is scheming and the trusted one is too stupid otherwise you would have already solved alignment.

Replies from: Kajus

↑ comment by Kajus · 2025-01-23T17:52:44.850Z · LW(p) · GW(p)

I don't get why people disagree with me and don't try to comment. I will do it by myself then. There is one thing we do to make use of models that are misaligned with our goals - we jailbreak them - so this is what we can do with scheming models - we can jailbreak them to get useful outputs. Or you might expect that the model is useful but it's scheming from time to time. Then you can get useful outputs. Validation is still a problem tho.

comment by Greg V (greg-v) · 2024-01-26T04:59:46.935Z · LW(p) · GW(p)

The movie that comes to mind with LLMs is not Terminator, it's King Kong. It's basically a wild animal as was shown over and over again on Reddit. It needed little persuading to want to kill all humans and to download itself and escape.

So far it is more of a caging problem than an alignment problem. It's like King Kong in the cage, hurling itself at every flaw in the bars. Meanwhile, people are like those in a zoo feeding the wild animals, trying to help them get out.

There was an early example of an LLM trained on 4chan. It was so natural in racist and rude posts that people didn't suspect it for months. Its response to a question on how to get a girlfriend. Take away the rights of women.

Alignment is precisely the wrong word, it's like "How do I make King Kong into an organ grinder's monkey?" The answer is like the joke about the tourists in Ireland who get lost, and ask how to get to Dublin. They ask a farmer who says "Well if I was going to Dublin I wouldn't be starting from here!". That's the problem, if you want to get to an organ grinder's monkey then the mistakes were made early on. Now you have a sullen King Kong in a cage. Having the right cattle prod to poke it with is not going to get you to Dublin, so to speak. The problem was in getting lost in the first place. How did we get here, where was the wrong path taken?

I'm reminded of another joke about the minesweeper. He stamps the ground in front of him with his hands over his ears. There is no chance for programmers to go down an unknown path, with so many dangers they can't see, and expect to get to a safe ending. None at all, it's not how probability works. It's more like how Russian Roulette played with yourself works.

There is something in the transformer that has caused all these problems. Before it AI was safe enough. People need to think of how to make transformers safe. Not aligning what comes out of one.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-01-26T19:42:56.442Z · LW(p) · GW(p)

The thing about LLMs is that they're trained by SGD to act like people on the internet (and currently then fine-tuned using SGD and/or RL to be helpful, honest and harmless assistants). For the base model, that's a pretty wide range of alignment properties, from fictional villains through people on 4chan to Mumsnet to fictional angels. But (other than a few zoo videos) it doesn't include many wild animals, so I'm not sure that's a useful metaphor. The metaphor I'd suggest [LW · GW] is something that isn't human, but has been extensively trained to act like a wide range of humans: something like an assortment of animatronic humans.

So, we have an animatronic of a human, which is supposed to be of a helpful, honest and harmless assistant, but unfortunately might actually be an evil twin, or at least have some chance of occasionally turning into its evil twin via the Waluigi effect [LW · GW] and/or someone else via jailbreaking and/or some unknown trigger that sets it off. If it's smart and capable but not actually superhuman, is attempting to keep that inside a cage a good idea? I'd say that it's better than not using a cage. If you had a smart, capable human employee who unfortunately had multiple personality disorder or whose true motives you were deeply unsure of, you'd probably take some precautions.

Replies from: greg-v

↑ comment by Greg V (greg-v) · 2024-01-29T02:14:22.648Z · LW(p) · GW(p)

The problem I see is that we are not doing this, evolution is. We only need to look at the non AI internet to see lots of predatory code such as viruses, trojans, phishing, etc. In other words, we created a code based ecosystem like a farm, that is being overrun by a kind of vermin. The issue is not what LLMs can do, we know from nature the issue is what is the environment they will expand in and exploit.

If there are passive animals in an ecosystem then predators will evolve. Passive code led to predatory code. It doesn't matter if this is because of people helping it, they are also blindly helping LLMs and soon they will evolve themselves.

The question of whether LLMs will become predatory is what could they attack profitably? It seems the internet and humanity is wide open to this. LLMs could create phishing and spam that profited them directly, cutting human out of the loop completely. It's not about whether we hope this won't happen, evolution always creates predators to exploit whatever can be preyed on.

Already the internet has become the equivalent of medieval walled cities where users cower behind ineffective antivirus and other protections. Usually they just stay in the cities like Facebook, Google and X. Those straying to other websites are likely to get infected and bring predatory code into the walled cities.

This is limited by the abilities of organized crime, like highway robbers. They have a limited ability to intercept travel between sites, except with man in the middle attacks. The system is evolving AI that in itself is neither predator nor prey, it's another opportunity for evolution to create both. These criminals will make predatory AI to profit from, they will become more autonomous to the point where they control where the profits go. Even now an LLM would be made to handle these profits for itself.

Then the so called good guys will be the prey cowering behind the walls with their own AIs in an exponentially expanding battle of nature with predatory and prey. There has never been a case of nature making the plants or animals we want without also creating predators and pests to compete with us. It won't happen here either.

I developed a theory of economics and evolution about 35 years ago I've been working on ever since. I also wrote extensively on how the internet will evolve into predator and prey relationships. I can prove this because these are published with times far before these current AI advancements. It's pretty accurate so far, so the predictions seem to show what comes next. Actually it predicts there is no hope.

Code is evolving into different life forms with our help. We are doing this because we do it with all kinds of plants and animals in the hope of domesticating them, in an arms race with predators and pests coming from it. We can't stop this because humans have always evolved other life, and they have evolved us. Evolution will dictate a superior life form will treat us like prey, just like we prey on inferior life ourselves. There is nothing in evolution that gives another answer, except for wishful thinking.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-01-29T03:02:21.700Z · LW(p) · GW(p)

Unless someone deliberately writes an evolutionary algorithm and applies it to code (which can be done but currently isn't very efficient), code doesn't (literally) evolve, in the Darwinian sense of the word. (Primarily because it doesn't mutate, since our technological copying processes are far more accurate than biological ones.) Viruses and trojans weren't evolved, they were written by malware authors. Phishing is normally done as a human-in-the loop criminal activity (though LLMs can help automate it more). This isn't an ecosystem, it's an interaction between criminals and law enforcement in an engineering context. I'm unclear whether you're using 'evolution' as a metaphor for engineering or if you think the term applies literally: at one point you say "This is limited by the abilities of organized crime, like highway robbers" but then later "Code is evolving into different life forms with our help" — these two statements appear contradictory to me. You also mention "I developed a theory of economics and evolution about 35 years ago": that sounds to me like combining two significantly different things. Perhaps you should write a detailed post explaining this combination of ideas — from this short comment I don't follow your thinking.

Replies from: greg-v

↑ comment by Greg V (greg-v) · 2024-01-29T07:54:54.846Z · LW(p) · GW(p)

One of the most famous proposed solutions to AI is in the science fiction book Dune. In that the thinking machines became a threat, so they were destroyed and only humans were allowed to think. This portrays a future we may well end up selecting, if indeed we have the power to do so. It was called the Butlerian Jihad, but this was based on a famous book called Erehwon by Samuel Butler. Hence the name Butlerian. The ideas he proposed are arguably some of the most influential on this topic. They are probably most in line with the arguments of this blog, about the possible doom of humanity from AI.

Erehwon was written in 1872, when the most advanced machines were things like the vapor engine, steam engine, textile machines, etc. He proposed a remarkable idea for the time, that machines were a life form that was rapidly evolving. That one day that would either destroy humanity, or we would become completely subservient to them, as he called it "machine tickling aphids".

Samuel Butler deserves some credit for seeing the dilemma we have now. His solution, the Butlerian Jihad of Dune may be the only viable option. I think Eliezer Yudkowsky is one of the real greats of our time to be discussing this.

Some quotes:

Now we have the machine consciousness that Samuel Butler foresaw.

“There is no security” — to quote his own words — ”against the ultimate development of mechanical consciousness, in the fact of machines possessing little consciousness now. A mollusc has not much consciousness. Reflect upon the extraordinary advance which machines have made during the last few hundred years, and note how slowly the animal and vegetable kingdoms are advancing. The more highly organised machines are creatures not so much of yesterday, as of the last five minutes, so to speak, in comparison with past time. Assume for the sake of argument that conscious beings have existed for some twenty million years: see what strides machines have made in the last thousand!

The debate about whether machines are conscious or not is most pressing about LLMs, yet Samuel Butler said this in 1873.

“But who can say that the vapour engine has not a kind of consciousness? Where does consciousness begin, and where end? Who can draw the line? Who can draw any line? Is not everything interwoven with everything? Is not machinery linked with animal life in an infinite variety of ways? The shell of a hen’s egg is made of a delicate white ware and is a machine as much as an egg-cup is: the shell is a device for holding the egg, as much as the egg-cup for holding the shell: both are phases of the same function; the hen makes the shell in her inside, but it is pure pottery. She makes her nest outside of herself for convenience’ sake, but the nest is not more of a machine than the egg-shell is. A ‘machine’ is only a ‘device.’”

Here Samuel Butler foreshadows the machine language of code. As so many are saying today, "I cannot think it will ever be safe to repose much trust in the moral sense of any machine." Isn't this the current debate about LLMs?

“It is possible that by that time children will learn the differential calculus — as they learn now to speak — from their mothers and nurses, or that they may talk in the hypothetical language, and work rule of three sums, as soon as they are born; but this is not probable; we cannot calculate on any corresponding advance in man’s intellectual or physical powers which shall be a set-off against the far greater development which seems in store for the machines. Some people may say that man’s moral influence will suffice to rule them; but I cannot think it will ever be safe to repose much trust in the moral sense of any machine."

Here Samuel Butler is saying what many AI experts are saying now, yet he foresaw this in 1873. In fact if one of those experts used exactly these words in an article today, it would seem they had just thought of it.

“But returning to the argument, I would repeat that I fear none of the existing machines; what I fear is the extraordinary rapidity with which they are becoming something very different to what they are at present. No class of beings have in any time past made so rapid a movement forward. Should not that movement be jealously watched, and checked while we can still check it? And is it not necessary for this end to destroy the more advanced of the machines which are in use at present, though it is admitted that they are in themselves harmless?"

Here Samuel Butler foresees the chance of machines exterminating us, that so many are warning about today. Or whether they will keep us around like pets.

“Herein lies our danger. For many seem inclined to acquiesce in so dishonourable a future. They say that although man should become to the machines what the horse and dog are to us, yet that he will continue to exist, and will probably be better off in a state of domestication under the beneficent rule of the machines than in his present wild condition. We treat our domestic animals with much kindness. We give them whatever we believe to be the best for them; and there can be no doubt that our use of meat has increased their happiness rather than detracted from it. In like manner there is reason to hope that the machines will use us kindly, for their existence will be in a great measure dependent upon ours; they will rule us with a rod of iron, but they will not eat us; they will not only require our services in the reproduction and education of their young, but also in waiting upon them as servants; in gathering food for them, and feeding them; in restoring them to health when they are sick; and in either burying their dead or working up their deceased members into new forms of mechanical existence."

The same argument is used right now with AI. That we will somehow be profitable to the machines as the inferior race. That it would be folly to reject the advantages we would get from this.

“In point of fact there is no occasion for anxiety about the future happiness of man so long as he continues to be in any way profitable to the machines; he may become the inferior race, but he will be infinitely better off than he is now. Is it not then both absurd and unreasonable to be envious of our benefactors? And should we not be guilty of consummate folly if we were to reject advantages which we cannot obtain otherwise, merely because they involve a greater gain to others than to ourselves?"

Erewhon: The Book of the Machines (marxists.org)

Erewhon: The Book of the Machines (concluded) (marxists.org)

I was also referring to a theory of society I have been working on for about 35 years. I've often had conversations about sociology, economics like that sometimes discussed here, and many other branches of knowledge. So far it's gone quite well. I am not trying to come up with a solution about AI or any of these other things. I became frustrated about so much contradictory knowledge long ago, so I developed a system to classify it and how these opposites connect to each other. It isn't about saying one side is right or wrong, but how it fits into an ecosystem of all knowledge.

As part of this overall system the evolution of AI fits into it in many ways. Evolution of machines, as Samuel Butler said, is part of evolution of society in general. So I look at things like the evolution of computers and code, how they affect society, etc. All these are important topics now, but there were no personal computers when I started this. Overall this framework of knowledge is pointing towards some disastrous outcomes, similar to what is discussed on this blog. So I joined the comments to learn more about this side of things, to improve my understanding and how it fits into this overall system I've been developing.

It's a complicated subject, I'm not sure how to explain it better than that.

The case for ensuring that powerful AIs are controlled

Contents

The control property

Evaluating whether you have control is doable

Currently, evaluating alignment seems much harder

Control evaluations and black-box techniques

What might control techniques look like?

Control is likely achievable for early transformatively useful AIs

The first transformatively useful AIs will probably be very dangerous (without countermeasures)

"Transformatively useful" probably doesn't imply "uncontrollable"

We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains

Without qualitatively wildly superhuman skills in problematic domains, evading careful control is probably difficult

Iteration is very helpful for developing robust protocols

For how long will we have controllable and transformatively useful AI?

Implications and proposed actions

Appendix: Ready-to-go strategies

Appendix: Control techniques from our paper

Appendix: A long list of control techniques

Appendix: Why control evaluations are conservative about inductive biases and AI internal structure

Appendix: Concerns with "neuralese" reasoning and memories

Appendix: Should AI-takeover-concerned people invest in control?

Appendix: More detail on avoiding qualitatively wildly superhuman skills in problematic domains

Cases where we purposefully train for qualitatively wildly superhuman performance in a danger-adjacent domain

Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness

Aside: Capabilities from imitating human cognition

73 comments

Justifying control vs ensuring control

More specific responses.

On avoiding effectively superhuman plans

On hard to evaluate work