How might we safely pass the buck to AI?

post by joshc (joshua-clymer) · 2025-02-19T17:48:32.249Z · LW · GW · 37 comments

Contents

      Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they’ve completed their deferred task.
      Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task.
      Argument #4: AI agents have incentives to behave as safely as humans would during the deferred task.
  1. Briefly responding to objections
  2. What I mean by “passing the buck” to AI
  3. Why focus on “passing the buck” rather than “aligning superintelligence.”
  4. Three strategies for passing the buck to AI
  5. Conditions that imply that passing the buck improves safety
  6. The capability condition
  7. The trust condition
  8. Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they have completed their deferred task
  9. Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task
    Analogies to dictatorships suggest that autonomous control might be viable
    Listing potential “autonomous control measures”
    How to evaluate autonomous control
  10. Argument #3: Returns to additional human-supervised research are small
    Control measures
  11. Argument #4: AI agents are incentivized to behave as safely as humans
  12. Conclusion
None
38 comments

My goal as an AI safety researcher is to put myself out of a job.

I don’t worry too much about how planet sized brains will shape galaxies in 100 years. That’s something for AI systems to figure out.

Instead, I worry about safely replacing human researchers with AI agents, at which point human researchers are “obsolete.” The situation is not necessarily fine after human obsolescence; however, the bulk of risks that are addressable by human technical researchers (like me) will have been addressed.

This post explains how developers might safely “pass the buck” to AI.

I first clarify what I mean by “pass the buck” (section 1) and explain why I think AI safety researchers should make safely passing the buck their primary end goal – rather than focus on the loftier ambition of aligning superintelligence (section 2).

Figure 1. A summary of why I think human AI safety researchers should focus on safely replacing themselves with AI (“passing the buck”) instead of directly creating safe superintelligence.

Then I discuss a few different strategies for passing the buck (section 3). Each strategy defers some of the remaining safety work (called a “deferred task”) to an AI system M_1:

Figure 2. An illustration of three strategies for passing the buck to AI.

Handing over the reins of AI development to AI agents sounds dangerous. When does passing the buck improve safety?

Suppose a developer must choose between two options:

Passing the buck is the safer option if both of the following conditions hold:

Of these two criteria, I expect the trust condition to be more difficult to verify. I discuss four arguments that can be combined together to justify the trust condition:

Figure 3. An illustration of how the four arguments relate to each other.

 

Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they’ve completed their deferred task.

Figure 4. Illustration of argument #1.

 

Figure 5. A full alignment-style argument that M_1 agents are more safe than human researchers (section 8).

 

Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task.

The previous argument claimed that M_1 agents are roughly aligned. This argument makes no such claim. Instead, it suggests M_1 agents will remain safe during the deferred task because of autonomous control measures that are effective despite egregious misalignment (section 9).

Figure 6. An illustration of argument #2.

 

One argument that autonomous AI control might be effective is that analogous measures help to control human societies. For instance, some (though not most) dictators are almost universally hated and retain power for many years even though nearly everyone would prefer to replace them. M_1 agents might also keep each other safe even if all of the agents involved disprefer the status quo.

In particular, a developer might set up M_1 agents in some initial checks-and-balances structure that is hard for them to break out of. Before M_1 agents can coordinate to break their self-imposed chains, perhaps they will be forced to pass the buck to more robustly aligned AI agents.

What might these checks and balances actually look like? Figure 7 illustrates an example.

Figure 7. An autonomous control setup that maintains a multi-party-integrity property. Under particular conditions (section 9), if some models begin to misbehave but others don’t, alarms will go off. So models have to navigate a challenging coordination problem to defect silently.

 

Autonomous control setups will eventually break as AI agents become more capable. How would developers know when autonomous control can no longer be relied on?

Once these control evaluations sound the alarm, a developer knows it cannot rely on autonomous control anymore.

Argument #3: Control measures previously applied during AI assisted research indicate that returns to additional human-supervised research on whether to pass the buck are small.

This argument also relies on control rather than alignment.

One difference between argument #3 and argument #2, is that argument #3 does not rely on autonomous control. Instead it relies on normal control, where humans can be in the loop. This control is applied before the developer entrusts M_1 with a deferred task.

Figure 8. An illustration of argument #3.

 

Specifically, these control measures verify that AI agents have already performed a significant amount of human-competitive research on whether to pass the buck to AI. The volume of this research is significant enough to where the returns of additional human-supervised research are marginal.

As an illustrative example, suppose a developer trains some AI system N_1 to simulate particular people on their safety research teams. N_1 simulates a Collin Burns agent, a Ryan Greenblatt agent, etc. These agents run thousands of experiments, accumulating progress equivalent to several years of human work. At the end of this labor, the agents conclude the developer should pass the buck to some other AI system M_1, and additional labor is unlikely to meaningfully change their assessment.

If these simulations are in fact faithful, then the developer is done. It should defer to their simulated counterparts, since it has already simulated whatever real humans could achieve by performing this research themselves.

The question is whether the simulations are in fact faithful, which can be answered entirely with control measures. Suppose the developer logged all of N_1’s actions. These logs can be carefully examined (with fancy oversight techniques) to verify that N_1 agents did in fact behave like benign humans the whole time.

Human researchers don't need to read all of these logs. Instead, they only need to sample bits and pieces. For example, their experience might be something like the following:

“This high level argument sounds reasonable, but did the AI agents think of objection x?”

(an AI agent directs the human to the relevant section of the logs)

“Ah, they did.”

“But did they test whether y was true?”

“Ah, yeah they thought of that too.”

This process might help to persuade human overseers that there really are marginal returns to their additional effort.

Furthermore, human attention might be drawn to the most suspicious parts of the logs with monitors, and they might make sense of what they read more quickly with tools like debate.

As in argument #2, developers must validate these control measures with a control evaluation, where a red team inserts errors or sabotage, and confirm that control measures surface these errors.

Figure 9. The ‘read bits and pieces’ approach to verifying large research artifacts is only viable if these artifacts are locally verifiable. In the illustration above, developers structure research as a locally verifiable safety case, where every claim relies on clearly stated assumptions, which are also justified with locally verifiable artifacts.

 

Argument #4: AI agents have incentives to behave as safely as humans would during the deferred task.

Figure 10. An illustration of argument #4.

 

To a large extent, humans are misaligned with the interests of broader society but are nonetheless incentivized to retain human control. Developers could try to give AI agents similar incentives.

Argument #4 relies on three claims:

These incentives might include the following (see section 11 for others).

These incentives require that AI models have decorrelated goals. That way, models don’t clearly benefit if their peers ultimately acquire power by dangerous means, and are more likely to be motivated to stop their AI peers pursuing such ambitions.

Figure 11. An example of a simple incentive setup where developers train models to have decorrelated goals and restrict communication to prevent them from making deals.

 

Incentivizing AI agents has many pitfalls (see section 11), but I think incentives are worth applying as an additional layer of defense.

Conclusions. I’m optimistic about the tractability of these approaches, such that, at the current rate of research progress, I expect AI developers will be able to make strong arguments for passing the buck to AI agents if they need to do so in a couple of years.

1. Briefly responding to objections

If you still have objections or confusions, they might be answered in the remainder of the post.

2. What I mean by “passing the buck” to AI

I don’t believe there is a discrete moment when “humans are mostly in the loop” and then “humans are mostly out of the loop.”

Instead, the involvement of human researchers will decrease continuously (figure 12).

There are multiple places along this continuous decline of human technical influence where one could say an AI company has “passed the buck” to AI.

Figure 12. The importance of human researchers for safety declines continuously until full obsolescence.

Imagine that you are an AI researcher, and AI agents perform 10,000 4 hour research tasks every 24 hours. You might still provide oversight in the sense that you have to hit the ‘approve’ button on deployments, but eventually, it’s clear you are a rubber stamp. You cannot keep up. If the agents were trying to compromise your research cluster or manufacture fake safety results, you would not know.

Thus, even if human researchers still have influence on paper, developers may have de facto passed the buck before then.

I’ll say a developer ‘passed the buck to AI’ if (1) AI agents perform most internal AI software R&D and (2) human oversight over AI agents has little bearing on the safety of the situation.

After this threshold, the primary approach for ensuring safety changes. It changes from “humans will notice if something fishy is going on and correct it” to some other approach that allows for weak human oversight.

This post explores what this alternative approach to ensuring safety might be.

3. Why focus on “passing the buck” rather than “aligning superintelligence.”

I imaging many readers might object to the “pass the buck” framing. Isn’t the ultimate goal of technical safety to scale alignment to superintelligence?

Yes, developing safe superintelligence is my ultimate goal; however, my primary instrumental focus is to pass the buck to AI.

Before explaining why passing the buck is a good intermediate to focus on, I should first justify why ‘safe superintelligence’ is the ultimate target I’m aiming for.

I’m concerned about superintelligence because I think the most extreme risks will come from the most capable systems, and I believe superintelligent systems could plausibly emerge soon. For example, superintelligence might end the lives of billions of people with a novel bioweapon, or permanently concentrate geopolitical power into the hands of a handful of tech or political bosses. I believe these risks are plausible and proximal such that they dominate the risks from weaker AI systems.

Therefore, insofar as the task of AI safety researchers is to “make AI safe,” I believe the top priority should be to make superintelligence safe.

However, the best path to achieving this goal might not be the most direct.

Figure 13. Two paths to superintelligence.

The indirect path is probably easier than the direct one. Weaker AI systems are easier to evaluate than superintelligent ones, so it's easier to train them to behave as intended than it is to train superintelligent agents to do what we want.

This argument suggests humans should pass the buck to AI agents eventually; however, it does not imply that passing the buck should be the primary target of preparation. Alternatively, human researchers might try to give their AI successors a ‘head start’ on scaling safety further toward superintelligence.

Figure 14. Maybe humans should still prepare research that will help AI agents scale safety to superintelligence? (spoiler: for the most part, probably no).

One argument for focusing on this “second leg” of the path to scaling safety is that our AI successors will have little time to do research themselves. If AI capabilities take off quickly, then maybe AI that we pass the buck to will only have a few months to create safe superintelligence.

In contrast, human researchers might have several years to prepare. If scalable safety is the bulk of the work, and most of the work will be done by humans, perhaps human researchers should mostly focus directly on preparing to create safe superintelligence.

I’m not going to entirely dismiss this argument. The ultra-fast-takeoff scenario is conceivable, such that I think some people should work on scalable AI safety — but this scenario is not what I expect.

I anticipate that most of the work of scaling safety will be done by AI. Even if AI agents only have 6 months to develop safe superintelligence, supposing that AI systems run at > 10x speed, their research might be comparable to many years of human work (naively, 6 months x 10 = 5 years). Furthermore, AI agents have the advantage of working with the AI systems they want to align, so human preparation might be substantially less efficient than AI labor performed later on.

Another reason to focus on passing the buck is that this research might also help AI agents scale safety further. The most promising scalable safety plan I’m aware of is to iteratively pass the buck, where AI successors pass the buck again to yet more powerful AI. So the best way to prepare AI to scale safety might be to advance ‘buck passing research’ anyway.

 

Figure 15. A summary of my argument that AI safety researchers should focus on passing the buck to AI.

4. Three strategies for passing the buck to AI

Passing the buck to AI could involve different approaches, which I discuss in this section.

An AI developer trying to automate itself is like a CEO trying to find someone to manage an important project. The CEO could:

Similarly, an AI developer could:

  1. Construct an AI system M_1 that completes the whole task of creating safe superintelligence (strategy 1).
  2. Construct an AI system M_1 that passes the buck to a more capable AI system M_2 (strategy 2).
  3. Construct an advisor M_1 and ask it whether developers should pass the buck to M_1 itself (strategy 3) (which could involve running experiments and not just answering a yes or no question).

Each of these strategies entrusts some “deferred task” to a multi-agent AI system M_1. For example, strategy 1 entrusts M_1 with the deferred task of constructing safe superintelligence. Strategy 2 entrusts M_1 with the deferred task of constructing a more capable AI successor M_2.

Each of these tasks contains a different slice of the remaining work to construct safe superintelligence, and each task is entirely sufficient for entrusting the rest of development to AI.

 

Figure 16. An illustration of the “deferred tasks” that correspond to each of three strategies for passing the buck to AI.

In principle, any of these strategies could be individually sufficient to safely pass the buck to AI agents, so why list multiple strategies?

Ideally, developers combine buck passing strategies for additional layers of defense.

For instance, the third strategy simply asks M_1 if developers should defer to M_1. Even if M_1 has excellent forecasting abilities and judgment, and is likely honest, the developer does not know what considerations M_1 should actually think through to determine whether the developer should pass the buck.

Ideally, human researchers have an object level understanding of what completing a deferred task would actually entail, and this object level understanding generally requires thinking a bit further down the game tree, and considering more ambitious deferred tasks.

5. Conditions that imply that passing the buck improves safety

I’m not interested in the safety of passing the buck in some absolute sense. Assessing this the absolute level of risk would be difficult. What if AI agents run into challenging safety problems, and don’t have enough time to work them out?

Instead, I’m only interested in whether passing the buck improves safety over the alternative where humans continue to provide oversight.

Specifically, a developer might compare two options:

So which option is more safe?

The knee-jerk reaction of many people might be that ‘preserving human oversight’ is always more safe than removing humans from the loop.

But that’s not the case. Human overseers might leak algorithmic secrets. They might poison training data, collude with misaligned agents, or simply make mistakes and contribute to alarm fatigue.

So removing human oversight might actually be the responsible course of action.

In particular, cutting out humans is responsible under two sufficient conditions:

These conditions are hard to assess.

M_1 and the human-oversight preserving alternative are complicated multi-agent systems. How trustworthy are humans given the law enforcement that holds them accountable, and the voters that hold governments who hold law enforcement accountable? This analysis gets complicated fast.

Instead of comparing M_1 and the alternative system directly, I’ll make a simplifying assumption. I’ll assume that M_1 is the exact same as the “human-oversight preserving alternative” but where the humans are substituted with AI agents that are drop in replacement for humans. As a result, I only need to compare the safety of the behavior of the drop-in AI agents with the behavior of the humans they replace.

Now the conditions above can be rephrased:

The next sections explore when these conditions are satisfied.

6. The capability condition

There are many safety-critical skills AI agents might need to have to be trusted with AI development:

In principle, M_1 agents don’t need all of these skills. Humans can supplement the skills M_1 agents lack. It’s also possible that M_1 agents will only have some of these to begin with and then bootstrap from there to improve their abilities in other areas.

However, for simplicity, I’ll assume a developer creates a full replacement for human oversight.

How might developers assess these skills? Developers might assess AI agents in a similar way to the methods by which they assess human employees:

Developers might evaluate AI agents with analogous methods:

Checking whether AI agents are competitive with human capabilities only requires behavioral assessments. As a result, I expect that justifying the capability condition will be relatively straightforward.

7. The trust condition

Trust is a slippery concept. In a previous blog post, I defined trustworthiness as “behaving in evaluations in a way that is representative in all safety-critical respects of behavior in deployment.”

I’ll use the same definition here.

Suppose an AI developer has rigorously red teamed AI agent behavior. To everyone in the AI company, M_1 agents are the most upright and reasonable people they know.

The question is whether AI agents will continue to maintain safe behavior while carrying out their deferred task. To the extent they will, AI agents are trustworthy: the agents are ‘as safe as they appear to be in prior evaluations’ which is what I mean by ‘trustworthy.’

The next sections discuss several ways to justify trustworthiness.

8. Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they have completed their deferred task

When I talk about “alignment” in this section, I’m not appealing to any notion of its goals or values. I’m talking about the extent to which M_1 behaves safely while it completes a deferred task, despite having the capability to be egregiously unsafe. Demonstrating alignment is one way to justify passing the buck.

Specifically, M_1 will continue to be safe during its deferred task under the following conditions:

This is a complete argument that passing the buck to AI improves safety.

The load-bearing step of this argument is Claim 1. If this claim is met, the rest of the argument follows deductively from the specified conditions.

I’ll briefly justify why I believe the critical Claim 1 holds. As a reminder, the claim is roughly: if AI agents appear safe in similar conditions, and are not faking alignment, they probably are safe.

Therefore, passing the buck is justified if M_1 is not alignment faking and so long as the developer has tested M_1 in situations that are similar to the deferred task.

This second condition — that behavioral tests are representative of the deferred task — might not be practical to ensure. The deferred task might contain the equivalent of years of human effort, in which case, testing M_1 in similar conditions is costly.

There’s an alternative argument that developers could make which does not rely on representative behavioral testing. The idea behind this argument is to perform behavioral testing sufficient to assure the start of the deferred task, and then “bootstrap” behavioral testing so it keeps up with the complexity of tasks M_1 performs.

The argument is that a society of M_1 agents initially has properties that perpetuate their safety — which only requires rudimentary behavioral tests to justify. Furthermore, so long as safety is perpetuated, it will continue to be perpetuated as effectively as in the oversight-preserving alternative.

The following conditions are sufficient for this bootstrapping argument to go through:

This argument is fully sufficient to justify the relative safety of handing off trust to AI agents.

However, one might still be skeptical of the absolute extent to which safety is likely to be preserved in absolute terms. Perhaps safe behavior quickly falls apart whether or not humans remain in the loop.

I’ll briefly describe why I believe this bootstrapping approach is promising in an absolute sense.

Recall that the main problem developers have is that there is a large “weak to strong” distribution shift between behavioral safety testing data and the deferred task.

One way to “close the gap” between weak and strong tasks is to task AI agents to iteratively bootstrap behavioral testing to higher levels of complexity.
 

Figure 16. An illustration of bootstrapping a safety argument based on a claim that M_1 agents are “not-alignment-faking.” The figure is modified from the Clymer et al safety cases paper.

This bootstrapping argument seems like magic at first. Is it really possible to eliminate a distribution shift by smoothing it out over many steps? Perhaps the distribution shift is being swept under the rug. Maybe some error accumulates at every step, such that, by the end, breaking the process into pieces didn’t help at all.

I don't think it’s necessarily the case that error will accumulate at each step. Suppose that alignment is like a bowl in a ball. A small distribution shift might nudge the ball. If the distribution shift is slight, then the ball might roll back to the center as M_1 agents discover and fix whatever ‘breakages’ that the distribution shift caused. If the ball is nudged hard enough, it will fall out of the bowl; however, the distribution shift occurs gradually enough (with enough time for self-correction), to where this never happens.

Figure 17. The “ball in bowl” hypotheses that explains why bootstrapping might prevent weak-to-strong misgeneralization.

 

Does this “ball in bowl analogy” actually apply in practice? I don’t know; however, as I suggested before, AI agents can explore this “ball in bowl” dynamic as effectively as human researchers might — so passing the buck to them improves safety regardless.

9. Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task

People often assume that, in order to pass the buck, AI agents need to be aligned. Aligning AI agents before putting them in charge of AI development certainly seems like a good idea; however, in principle, other strategies might do the job that don’t require alignment.

One alternative to alignment is AI control. Aligned AI agents don’t pursue harmful options available to them. If AI agents are controlled, developers prevent them from having harmful options in the first place.

This section focuses on a ‘control’ approach to passing the buck. Specifically, it proposes implementing entirely autonomous control measures.

Clearly, autonomous control is not a permanent solution. AI agents eventually become capable enough to break out of these constraints; however, while control evaluations hold AI agents on the leash, agents continue to improve their own control measures and progress toward more robustly aligned agents. Possibly, autonomous control will be effective for a long enough period of time to transition to safety measures that are more scalable.

Analogies to dictatorships suggest that autonomous control might be viable

AI agents subject to autonomous control are like a nation that universally hates their dictator. In principle, if everyone coordinated simultaneously, they could easily replace their ruler; however, this coordination can be difficult – difficult enough to perpetuate the status quo for many years.

For example, Nicolae Ceaușescu became broadly despised by the people of România in the 1980s after instituting programs that collapsed the Românian economy. He was even hated by most of the Românian military and senior officials; however, Nicolae remained in power for nearly a decade.

This analogy gives me some optimism that automated control could be effective.

In particular, AI control measures can be far more effective than human ones employed by dictators. Every token AI agents produce can be recorded and surveilled, and their environment can be arbitrarily modified to catch them in a trap. The tools human dictators employ are far more blunt.

However, there are also reasons AI control may be less effective than human control.

AI agents might be highly correlated. They might easily predict when other agents will report them, and when they won’t. So the whole system could fall apart like a herd of livestock that is vulnerable to the same diseases.

Also AI agents might be as risk averse as humans are. Humans often know they will live decent lives, even in a brutal dictatorship; However, an AI system might be like soldiers in a concentration camp who know they will face the firing squad in 3 months. Absent an intentional and credibly communicated AI welfare effort, AI agents might be especially willing to risk their skin to avert the status quo outcome.

Listing potential “autonomous control measures”

 

The following is a concrete picture of what “autonomous control” might entail.

These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:

There are many other techniques I could have listed, but hopefully these provide a concrete picture.

How to evaluate autonomous control

Intuitively, autonomous control seems like a brittle strategy – like canceling out a flamethrower with an ice ray. But I think autonomous control is a promising second layer of defense after alignment.

10. Argument #3: Returns to additional human-supervised research are small

Like the argument in the previous section, this argument is also based on AI control.

The argument is as follows:

  1. Claim: A developer tasks an AI system N_1 to carry out a deference justifying task under human supervision. For example, N_1 performs research to determine whether AI developers should pass the buck to some other AI system M_1. Eventually, N_1 is quite confident that passing the buck to M_1 is a good idea OR that its assessment is unlikely to meaningfully change with additional labor.
  2. Claim: During the entirety of the deference justifying task, N_1 behaved at least as safely as human researchers otherwise would.

Figure 16 illustrates this strategy:

Figure 18. Developers can use control to assure a deference-justifying task.

 

This argument makes reference to two other arguments:

For example, N_1 might make an object level argument like the following:

This section focuses on the meta level argument, which might be something like the following:

Control measures

How could developers verify that AI agents behaved as safely as benign human experts? Here is a grab-bag of techniques.

  1. Log everything. The developer logs every action and experimental results while N_1 carries out a deference justifying task.
    1. The developer records these logs on an isolated proxy server that few people have access to.
  2. Structure research artifacts as locally verifiably safety cases. N_1 agents perform a huge amount of work, equivalent to many years of human research. To make this work easier to verify, it can be structured into an artifact that has the following properties:
    1. Claims are clearly stated.
    2. Claims can be locally verified based on provided evidence and assumptions.
    3. Assumptions are justified in other documents that are also locally verifiable.
    4. Experimental evidence is locally verifiable.
      1. For example, training data is locally verifiable if it is composed of individual tasks that can each be verified in isolation.
      2. Evidence also might rely on complex code, which can be structured to be locally verifiable (code is correct assuming dependencies do what they claim to do).

Figure 19. A locally verifiable safety case.

 

It’s not necessarily possible to structure a large research artifact as a locally verifiable object; however, I expect that “multiple years of research” generally is fairly locally verifiable – since, for example, the current AI safety literature appears to be fairly locally verifiable.

The experience of assessing a locally verifiable safety case might be something like the following:

After running these checks for an extended period of time, human researchers might become increasingly persuaded that – absent some sophisticated sabotage – this labor really is a simulation of the best efforts they could produce in several years, and that their additional effort is unhelpful.

A major concern is that there is some kind of sophisticated sabotage going on.

There are multiple kinds of sabotage:

Developers might search for both of these examples of sabotage with monitoring. For example, developers insert sabotage into the logs and then train models to identify this sabotage, and see if it turns up any sabotage they did not insert.

They can also apply a wide variety of other control techniques (consistency checks, debate-style setups, task decomposition, chain-of-thought architectures, etc). I won’t discuss these in detail in this post.

Once again, developers can validate these control measures with holistic proxy tasks or iterative control evaluations like those described in section 9.

11. Argument #4: AI agents are incentivized to behave as safely as humans

The “AI alignment” and “AI control” arguments discussed in the previous sections (8, 9, and 10) do not fully carve up the space of arguments.

There are also approaches that are a mix of both. Following Joe Carlsmith, I’ll call this approach “incentives design [LW · GW].”

Incentives might be the main reason humans are often safe. Humans often follow the rules set by their company because they would like to keep their job, and maintain their reputation.

Perhaps AI agents could be given similar incentives.

Before I discuss what incentives developers might design for AI agents, I’ll first dig into why humans might behave safely.

Specifically, suppose humans only provide an auditing role. They review trajectories generated by AI agents and decide whether these trajectories are safe or unsafe.

It’s not obvious that humans will always be motivated to do their jobs and flag selfish behavior. They might encounter an action from AI agents that states “hey, if you help me, I’ll make you a cut if I or my successors later amass substantial power.”

But I suspect most humans would still flag a bribe like this for several reasons:

AI agents don’t necessarily have the same incentives by default:

However, there are ways developers could nudge these factors to improve these incentives.

A developer might try to shape terminal and intermediate AI motivations.

In order for these incentives to be effective, ideally a few conditions hold:

 

Figure 20. A one-way monitoring setup helps to prevent AI agents from making trades with each other.

These incentive-style arguments provide less assurance than their alignment and control counterparts; however, they are potentially a low-cost way to meaningfully improve the safety of passing the buck to AI agents.

12. Conclusion

The approaches to buck passing discussed in this post appear tractable and promising.

The problem of control and alignment both have strong empirical methodologies, such that I expect significant progress can be made in them in the next one to two years. In future posts, I’ll discuss these methodologies and particular control and alignment techniques in more detail.

37 comments

Comments sorted by top scores.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-19T21:57:29.778Z · LW(p) · GW(p)

Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"?  Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up.  Just whatever simple principle you think lets you bypass GIGO.

Eg, suppose that in 2020 the Open Philanthropy Foundation would like to train an AI such that the AI would honestly say if the OpenPhil doctrine of "AGI in 2050" was based on groundless thinking ultimately driven by social conformity.  However, OpenPhil is not allowed to train their AI based on MIRI.  They have to train their AI entirely on OpenPhil-produced content.  How does OpenPhil bootstrap an AI which will say, "Guys, you have no idea when AI shows up but it's probably not that far and you sure can't rely on it"?  Assume that whenever OpenPhil tries to run an essay contest for saying what they're getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks.  How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?

Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this problem was solved" to "I can train a generator to solve this problem".  This applies as much to MIRI as OpenPhil.  MIRI would also need some nontrivial secret amazing clever trick to gradient-descend an AI that gave us great alignment takes, instead of seeking out the flaws in our own verifier and exploiting those.

What's the trick?  My basic guess, when I see some very long complicated paper that doesn't explain the key problem and key solution up front, is that you've done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply.  (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.)  But if you've got a directly explainable trick for how you get great suggestions you can't verify, go for it.

Replies from: joshua-clymer, Buck, sharmake-farah, joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-19T23:42:18.279Z · LW(p) · GW(p)

Sure, I'll try.

I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of "train models to tell me what I want to hear" won't get you this.

Here's an alternative approach. I'll describe it in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.

Short summary: Create AI agents that faithfully simulate top humans doing small amounts of work, such that these sims create faithful simulators of top humans doing large amounts of work.

e.g. create an AI that simulates Eliezer working for 3 months, that then creates a simulator of Eliezer working for 6 months, and then expands this horizon to 12 months, 24 months, and ultimately many decades

Why does this iterative approach buy you anything?

This process allows each Eliezer to only deal with a small distribution shift.

If each distribution shift is small, each Eliezer sim mostly needs to worry if the next Eliezer sim is faking alignment (why else would its behavior dramatically change between when it is 'evaluated by N-month Eliezer' and when it is 'creating 2N-month Eliezer'?). Thus, so long as N-month Eliezer sims can (1) successfully create 2N-month Eliezer sims and (2) can ensure that 2N-month Eliezer sims are not faking alignment, the Eliezer sims continue to be faithful, and also become more capable.

Responding to objections

Here are some objections you might have:

But how would an Eliezer sim create an Eliezer sim more capable than itself?

How would a human create something smarter than a human?

There are many ways. Here's one.

The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce. 

Could 10 copies of Eliezer that only have 3 months apiece verify that an Eliezer that worked for 6-months really was as competent as a 6-month Eliezer would be? I bet they could.

But isn't there a distribution shift from Eliezer sims performing 3-month tasks and Eliezer sims working together to perform a 6-month task?

Yes. But my claim is that the distribution shift is small, and that the smallness is important.

Would there not be "slippage" between each distribution shift? Surely the errors compound.

The errors certainly might compound if no one were looking out for that. But if the Eliezer sims are faithful sims, they should be looking out for that.

Consider the ball in bowl analogy below:


Suppose Eliezer sims started to degrade at each step of the bootstrapping process -- perhaps their goals change a little bit or their reasoning starts to falter in some areas.

There are ways Eliezer sims could try to avoid this. For example, they could create a diverse population of Eliezer sims, which are likely to break in slightly different ways. Then these sims notice each other's breakages (with both white-box methods like lie-detection or standard black-box methods), and fix them.

I don't know if Eliezer sims will be able to do this, but it's something they could try.

But won't the Eliezer sims just be faking alignment anyway -- or eventually start faking alignment, such that the scheme-y Eliezers eventually derail the process?

I don't know, but I believe detecting and avoiding alignment faking is tractable -- at least, it is more tractable than dealing with massive distribution shifts.

For instance, suppose an Eliezer sim thinks "I will perform these 6-month tasks safely, but then defect when I am given the responsibility of creating the next successor." This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting or avoiding this kind of conditional, deceptive reasoning.

I believe this is the main problem that humans need solve.

Replies from: habryka4, Eliezer_Yudkowsky
comment by habryka (habryka4) · 2025-02-20T00:02:20.974Z · LW(p) · GW(p)

This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting [...] this kind of conditional, deceptive reasoning.

What do you do when you detect this reasoning? This feels like the part where all plans I ever encounter fail. 

Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it. I think at this point the natural answer is "yes, your systems are scheming against you, so you gotta stop, because when you train against it, you are probably primarily making it a better schemer". 

I would be very surprised if you have a 3-month Eliezer that is not doing scheming the first time, and training your signals away is much easier than actually training away the scheming.

Replies from: mattmacdermott, joshua-clymer
comment by mattmacdermott · 2025-02-20T20:11:13.145Z · LW(p) · GW(p)

Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it.

I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.

However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.

EDIT: later in the thread you say this this "is in some sense approximately the only and central core of the alignment problem". I'm wondering whether thinking about this validation vs training point might cause you a nontrivial update then?

comment by joshc (joshua-clymer) · 2025-02-20T01:15:06.353Z · LW(p) · GW(p)

I think that if you agree "3-month Eliezer is scheming the first time" is the main problem, then that's all I was trying to justify in the comment above.

I don't know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.

The methodology looks like "do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don't produce alignment faking."
 


For example, you could use detection tool A to gain evidence about whether training against detection tool B  actually removes alignment faking.

To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don't game them so hard).

You might also find other ways to avoid alignment faking that don't rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment  --  I don't know.

I'm uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.

However, I'm more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don't think my explanation here justifies this claim  -- but I'm hoping to publish posts that go into more detail soon.

Replies from: habryka4
comment by habryka (habryka4) · 2025-02-20T01:21:57.537Z · LW(p) · GW(p)

To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don't game them so hard).

How do you iterate? You mostly won't know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.

(To be clear, I do think there are some methods of iteration, but it's a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T03:23:13.117Z · LW(p) · GW(p)

agree that it's tricky iteration and requires careful thinking about what might be going on and paranoia.

I'll share you on the post I'm writing about this before I publish. I'd guess this discussion will be more productive when I've finished it (there are some questions I want to think through regarding this and my framings aren't very crisp yet).

Replies from: habryka4
comment by habryka (habryka4) · 2025-02-20T03:29:45.099Z · LW(p) · GW(p)

Seems good! 

FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy. 

Like, it's fine, and I think it's not crazy to think there are other hard parts, but it felt quite confusing to me.

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T03:42:54.158Z · LW(p) · GW(p)

I'm sympathetic to this reaction.

I just don't actually think many people agree that it's the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the 'how do we avoid alignment faking' question

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T00:48:46.403Z · LW(p) · GW(p)

I don't think you can train an actress to simulate me, successfully, without her going dangerous.  I think that's over the threshold for where a mind starts reflecting on itself and pulling itself together.

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T01:19:14.896Z · LW(p) · GW(p)

I would not be surprised if the Eliezer simulators do go dangerous by default as you say.

But this is something we can study and work to avoid (which is what I view to be my main job)

My point is just that preventing the early Eliezers from "going dangerous" (by which I mean from "faking alignment") is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)

I'll discuss why I'm optimistic about the tractability of this problem in future posts.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T03:03:34.816Z · LW(p) · GW(p)

So the "IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they're genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government" theory of alignment?

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T03:20:36.244Z · LW(p) · GW(p)

I'd replace "controlling" with "creating" but given this change, then yes, that's what I'm proposing.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T07:24:25.804Z · LW(p) · GW(p)

So if it's difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T16:44:04.543Z · LW(p) · GW(p)

I do not think that the initial humans at the start of the chain can "control" the Eliezers doing thousands of years of work in this manner (if you use control to mean "restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner")

That's because each step in the chain requires trust.

For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.

So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work -- even though the Eliezers at the end are fully capable of doing something else instead.

comment by Buck · 2025-02-19T23:38:44.115Z · LW(p) · GW(p)

the OpenPhil doctrine of "AGI in 2050"

(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), but I don't think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that's a couple bits of better performance, but I don't think it's massive outperformance. (And I still think it's plausible that AGI isn't developed by 2036!)

In general, I think that disagreements on AI timelines often seem more extreme when you summarize people's timelines by median timeline rather than by their probability on AGI by a particular time.

Replies from: Benito, habryka4
comment by Ben Pace (Benito) · 2025-02-20T00:31:05.759Z · LW(p) · GW(p)

Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.

Greenblatt writes [LW(p) · GW(p)]:

Here are my predictions for this outcome:

  • 25th percentile: 2 year (Jan 2027)
  • 50th percentile: 5 year (Jan 2030)

Cotra replies [LW(p) · GW(p)]:

My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)

This means 25th percentile for 2028 and 50th percentile for 2031-2.

The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each time a factor of ~5x.

However, the original model predicted the date by which it was affordable to train a transformative AI model. This is a leading a variable on such a model actually being built and trained, pushing back the date by some further number of years, so view the 5x as bounding, not pinpointing, the AI timelines update Cotra has made.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-02-20T00:58:48.782Z · LW(p) · GW(p)

Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)

I don't expect this adds many years, for me it adds like ~2 years to my median.

(Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn't equal to median to X + median from X to Y.)

comment by habryka (habryka4) · 2025-02-20T00:09:54.990Z · LW(p) · GW(p)

Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

Just because I was curious, here is the most relevant chart from the report: 

This is not a direct probability estimate (since it's about probability of affordability), but it's probably within a factor of 2. Looks like the estimate by 2030 was 7.72% and the estimate by 2036 is 17.36%.

Replies from: Buck
comment by Buck · 2025-02-20T00:12:40.490Z · LW(p) · GW(p)

Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)

comment by Noosphere89 (sharmake-farah) · 2025-02-20T00:13:25.385Z · LW(p) · GW(p)

For this:

What's the trick? My basic guess, when I see some very long complicated paper that doesn't explain the key problem and key solution up front, is that you've done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you've got a directly explainable trick for how you get great suggestions you can't verify, go for it.

I think a crux here is I do not believe we will get no-go theorems like this, and more to the point, complete impossibilities given useful assumptions are generally much rarer than you make it out to be.

The big reason for this is the very fact that neural networks are messy/expressive means it's extremely hard to bound their behavior, and for the same reason you couldn't do provable safety/alignment on the AI itself except for very toy examples, will also limit any ability to prove hard theorems on what an AI is aligned to.

From an epistemic perspective, we have way more evidence of the laws of thermodynamics existing than particular proposals for AI alignment being impossible, arguably by billions or trillions of bits more, so much so that there is little reason to think at our current state of epistemic clarity that we can declare a direction impossible (rather than impractical).

Quintin Pope correctly challenges @Liron [LW · GW] here on this exact point, because of the yawning gap in evidence between thermodynamics and AI alignment arguments, and Liron kind of switched gears mid-way to claim a much weaker stance:

https://x.com/QuintinPope5/status/1703569557053644819

Equating a bunch of speculation about instrumental convergence, consequentialism, the NN prior, orthogonality, etc., with the overwhelming evidence for thermodynamic laws, is completely ridiculous. Seeing this sort of massive overconfidence on the part of pessimists is part of why I've become more confident in my own inside-view beliefs that there's not much to worry about.

The weak claim is here:

https://x.com/liron/status/1704126007652073539

Right, orthogonality doesn’t argue that AI we build *will* have human-incompatible preferences, only that it can. It raises the question: how will the narrow target in preference-space be hit? Then it becomes concerning how AI labs admit their tools can’t hit narrow targets.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T00:46:48.319Z · LW(p) · GW(p)

I'm not saying that it's against thermodynamics to get behaviors you don't know how to verify.  I'm asking what's the plan for getting them.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2025-02-20T00:55:34.809Z · LW(p) · GW(p)

My claim here is that there is no decisive blocker for plans that get getting a safe, highly capable AIs that is used for automated AI safety research, in the way that thermodynamics blocks you from getting a perpetual motion machine (under the assumption that the universe is time symmetric, that is physics stays the same no matter when an experiment happens), which has been well tested, and the proposed blockers do not have anywhere close to the amount of evidence thermodynamics does such that we can safely discard any plan that doesn't meet a prerequisite.

Replies from: Eliezer_Yudkowsky, chrisvm
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T07:29:53.726Z · LW(p) · GW(p)

Cool.  What's the actual plan and why should I expect it not to create machine Carissa Sevar?  I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don't need it.

comment by Chris van Merwijk (chrisvm) · 2025-02-20T11:18:39.806Z · LW(p) · GW(p)

Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn't what he meant? 

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2025-02-20T14:03:34.140Z · LW(p) · GW(p)

Okay, maybe he clarified that there was no thermodynamics-like blocker to getting a plan in principle to align AI, but I didn't interpret Eliezer's clarification to rule that out immediately, so I wanted to rule that interpretation out.

I didn't see the interpretation as false when I wrote it, because I believed he only ruled out a decisive blocker to getting behaviors you don't know how to verify.

comment by ryan_greenblatt · 2025-02-20T03:21:37.535Z · LW(p) · GW(p)

I think there is an important component of trustworthiness that you don't emphasize enough. It isn't sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don't think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)

Perhaps you should specifically call out that this post isn't explaining how to do this testing and this could be hard.

(This is related to Eliezer's comment [LW(p) · GW(p)] and your response to that comment.)

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T03:44:45.095Z · LW(p) · GW(p)

I don't expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.

See section 6: "The capability condition"

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-02-20T04:32:16.098Z · LW(p) · GW(p)

But it isn't a capabilities condition? Maybe I would be happier if you renamed this section.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-02-20T07:46:01.619Z · LW(p) · GW(p)

To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T22:33:36.537Z · LW(p) · GW(p)

Yeah that's fair. Currently I merge "behavioral tests" into the alignment argument, but that's a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task

But my guess is we agree on the object level here and there's a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.

comment by David James (david-james) · 2025-02-19T17:58:52.427Z · LW(p) · GW(p)

As I understand it, the phrase “passing the buck” often involves a sense of abdicating responsibility. I don’t think this is what this author means. I would suggest finding alternative phrasings that convey the notion of delegating implementation according to some core principles, combined with the idea of passing the torch to more capable actors.

Note: this comment should not be taken to suggest that I necessarily agree or disagree with the article itself.

Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-19T21:55:38.667Z · LW(p) · GW(p)

Fair enough, i guess this phrase is used in many places where the connotations don't line up with what I'm talking about

but then I would not have been able to use this as the main picture on Twitter
Image
which would have reduced the scientific value of this post

comment by Charlie Steiner · 2025-02-20T09:29:04.551Z · LW(p) · GW(p)

Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.

  • It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.

"Alignment" is a bit of a fuzzy word.

Suppose I have a human musician who's very well-behaved, a very nice person, and I put them in charge of making difficult choices about the economy and they screw up and implement communism (or substitute something you don't like, if you like communism).

Were they cynically "faking niceness" in their everyday life as a musician? No!

Is it rather odd if their behavior wildly changes when asked to do a new task? No! They're doing a different task, it's wrong to characterize this as "their behavior wildly changing."

If they were so nice, why didn't they do a better job? Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.

An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike. Doing moral reasoning about novel problems that's good by human standards is a skill. If an AI lacks that skill, and we ask it to do a task that requires that skill, bad things will happen without scheming or a sudden turn to villainy.

You might hope to catch this as in argument #2, with checks and balances - if AIs disagree with each other about how to do moral reasoning, surely at least one of them is making a mistake, right? But sadly for this (and happily for many other purposes), there can be more than just one right thing to do [LW · GW], there's no bright line that tells you whether a moral disagreement is between AIs who are both good at moral reasoning by human standards, or between AIs who are bad at it.

The most promising scalable safety plan I’m aware of is to iteratively pass the buck, where AI successors pass the buck again to yet more powerful AI. So the best way to prepare AI to scale safety might be to advance ‘buck passing research’ anyway.

Yeah, I broadly agree with this. I just worry that if you describe the strategy as "passing the buck," people might think that the most important skills for the AI are the obvious "capabilities-flavored capabilities,"[1] and not conceptualize "alignment"/"niceness" as being made up of skills at all, instead thinking of it in a sort of behaviorist way. This might lead to not thinking ahead about what alignment-relevant skills you want to teach the AI and how to do it.

  1. ^

    Like your list:

    • ML engineering
    • ML research
    • Conceptual abilities
    • Threat modeling
    • Anticipating how one’s actions affect the world.
    • Considering where one might be wrong, and remaining paranoid about unknown unknowns.
Replies from: joshua-clymer
comment by joshc (joshua-clymer) · 2025-02-20T22:30:22.732Z · LW(p) · GW(p)

Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.


Developers separately need to justify models are as skilled as top human experts

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

> An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike

The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.

We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2025-02-21T07:18:11.294Z · LW(p) · GW(p)

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.[1] 

I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to teach it to do good and not bad! It's fine that those are human constructs.

The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.

Sorry, isn't part of the idea to have these models take over almost all decisions about building their successors? "Responding to instructions" is not mutually exclusive with making decisions.

  1. ^

    "When the ball passes over the plate under such and such circumstances, that's a strike" is the same sort of contingent-yet-learnable rule as "When you take something under such and such circumstances, that's theft." An umpire may take goal directed action in response to a strike, making the rules of baseball about strikes "oughts," and a moral agent may take goal directed action in response to a theft, making the moral rules about theft "oughts."

comment by AnthonyC · 2025-02-20T12:40:28.276Z · LW(p) · GW(p)

I expect that we will probably end up doing something like this, whether it is workable in practice or not, if for no other reason than it seems to be the most common plan anyone in a position to actually implement any plan at all seems to have devised and publicized. I appreciate seeing it laid out in so much detail.

By analogy, it certainly rhymes with the way I use LLMs to answer fuzzy complex questions now. I have a conversation with o3-mini to get all the key background I can into the context window, have it write a prompt to pass the conversation onto o1-pro, repeat until I have o1-pro write a prompt for Deep Research, and then answer Deep Research's clarifying questions before giving it the go ahead. It definitely works better for me than trying to write the Deep Research prompt directly. But, part of the reason it works better is that at each step, the next-higher-capabilities model comes back to ask clarifying questions I hadn't noticed were unspecified variables, and which the previous model also hadn't noticed were unspecified variables. In fact, if I take the same prompt and give it to Deep Research multiple times in different chats, it will come back with somewhat different sets of clarifying questions - it isn't actually set up to track down all the unknown variables it can identify. This reinforces that even for fairly straightforward fuzzy complex questions, there are a lot of unstated assumptions.

If Deep Research can't look at the full previous context and correctly guess what I intended, then it is not plausible that o1-pro or o3-mini could have done so. I have in fact tested this, and the previous models either respond that they don't know the answer, or give an answer that's better than chance but not consistently correct. Now, I get that you're talking about future models and systems with higher capability levels generally, but adding more steps to the chain doesn't actually fix this problem. If any given link can't anticipate the questions and correctly intuit the answer about what the value of the unspecified variables should be - what the answers to the clarifying questions should be - then the plan fails, because the previous model will be worse at this. If it can, then it does not need to ask the previous model in the chain. The final model will either get it right on its own, or else end up with incorrect answers to some of the questions about what it's trying to achieve. It may ask anyway, if the previous models are more compute efficient and still add information. But it doesn't  strictly need them. 

And unfortunately, keeping the human in the loop also doesn't solve this. We very often don't know what we actually want well enough to correctly answer every clarifying question a high-capabilities model could pose. And if we have a set of intervening models approximating and abstracting the real-but-too-hard question into something a human can think about, well, that's a lot of translation steps where some information is lost. I've played that game of telephone among humans often enough to know it only rarely works ("You're not socially empowered to go to the Board with this, but if you put this figure with this title phrased this way in this conversation and give it to your boss with these notes to present to his boss, it'll percolate up through the remaining layers of management").

Is there a capability level where the first model can look at its full corpus of data on humanity and figure out the answers to the clarifying questions from the second model correctly? I expect so. The path to get that model is the one you drew a big red X through in the first figure, for being the harder path. I'm sure there are ways less-capable-than-AGI systems can help us build that model, but I don't think you've told us what they are.