How might we safely pass the buck to AI?
post by joshc (joshua-clymer) · 2025-02-19T17:48:32.249Z · LW · GW · 37 commentsContents
Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they’ve completed their deferred task. Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task. Argument #4: AI agents have incentives to behave as safely as humans would during the deferred task. 1. Briefly responding to objections 2. What I mean by “passing the buck” to AI 3. Why focus on “passing the buck” rather than “aligning superintelligence.” 4. Three strategies for passing the buck to AI 5. Conditions that imply that passing the buck improves safety 6. The capability condition 7. The trust condition 8. Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they have completed their deferred task 9. Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task Analogies to dictatorships suggest that autonomous control might be viable Listing potential “autonomous control measures” How to evaluate autonomous control 10. Argument #3: Returns to additional human-supervised research are small Control measures 11. Argument #4: AI agents are incentivized to behave as safely as humans 12. Conclusion None 38 comments
My goal as an AI safety researcher is to put myself out of a job.
I don’t worry too much about how planet sized brains will shape galaxies in 100 years. That’s something for AI systems to figure out.
Instead, I worry about safely replacing human researchers with AI agents, at which point human researchers are “obsolete.” The situation is not necessarily fine after human obsolescence; however, the bulk of risks that are addressable by human technical researchers (like me) will have been addressed.
This post explains how developers might safely “pass the buck” to AI.
I first clarify what I mean by “pass the buck” (section 1) and explain why I think AI safety researchers should make safely passing the buck their primary end goal – rather than focus on the loftier ambition of aligning superintelligence (section 2).
Figure 1. A summary of why I think human AI safety researchers should focus on safely replacing themselves with AI (“passing the buck”) instead of directly creating safe superintelligence.
Then I discuss a few different strategies for passing the buck (section 3). Each strategy defers some of the remaining safety work (called a “deferred task”) to an AI system M_1:
- Strategy 1: One-time hand off. Construct M_1 that completes the whole task of creating safe superintelligence.
- Strategy 2: Iterated hand off. Construct M_1 that passes the buck to a more capable AI system M_2.
- Strategy 3: Deference to an AI advisor. Construct an advisor M_1. Its deferred task is to say whether M_1 itself should be trusted with one of the tasks above.
Figure 2. An illustration of three strategies for passing the buck to AI.
Handing over the reins of AI development to AI agents sounds dangerous. When does passing the buck improve safety?
Suppose a developer must choose between two options:
- Pass the buck. The developer trusts M_1 to autonomously complete some deferred task from the aforementioned strategies (e.g. ‘make a more capable successor M_2’).
- Maintain human oversight. Alternatively, the developer can continue to rely on human oversight to assure safety.
Passing the buck is the safer option if both of the following conditions hold:
- The capability condition: M_1 agents are more capable than the humans they replace on all dimensions that pertain to the deferred task. Researchers use M_1 for several months and M_1’s behavior is more competent, more reliable, more ‘wise,’ than that of human staff, even under aggressive adversarial testing. The developer also runs targeted evaluations of M_1, for example, removing AI safety research from 2024 from its training data and asking it to re-discover 2024 AI safety research results. M_1 consistently exceeds human performance.
- The trust condition: M_1 agents are more trustworthy than the humans they replace. The AI agents are approximately as safe as they appear to be, such that they will probably remain as safe as humans for the full duration of the deferred task.
Of these two criteria, I expect the trust condition to be more difficult to verify. I discuss four arguments that can be combined together to justify the trust condition:
- Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they’ve completed their deferred task.
- Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task.
- Argument #3: Control measures previously applied during AI assisted research indicate that returns to additional human-supervised research on whether to pass the buck are small.
- Argument #4: AI agents have incentives to behave as safely as humans would during the deferred task.
Figure 3. An illustration of how the four arguments relate to each other.
Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they’ve completed their deferred task.
Figure 4. Illustration of argument #1.
- Humans are not particularly aligned with the interests of broader society. They are some mixture of earnestly trying to do their jobs well and sycophantically seeking approval from their peers. So AI agents don’t have to be particularly aligned either to be more safe than humans.
- M_1 is more trustworthy than humans under two conditions:
- Condition 1: M_1 agents are not initially ‘alignment faking.’ The agents are not intelligently hiding misalignment to appear dramatically more safe than they are (for example M_1 agents are not scheming or egregiously reward hacking [AF · GW]).
- Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.
- It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.
- So if they are not faking alignment, the agents will probably start out behaving in a safe, well-intentioned way.
- Part of behaving safely is trying to maintain safe behavior.
- For example, M_1 agents might constantly interrogate each other, and use mind-reading tools to determine if any of their peers are drifting from safe goals. A developer needs to argue whatever techniques M_1 agents apply to preserve their safety will be as effective as the techniques that developers would otherwise apply if humans remained in the loop.
- If these techniques for maintaining safety are relatively effective, then there’s an argument by induction that they will persist their own effectiveness such that they remain relatively effective for the whole duration of the deferred task (see section 8 for more detail).
Figure 5. A full alignment-style argument that M_1 agents are more safe than human researchers (section 8).
Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task.
The previous argument claimed that M_1 agents are roughly aligned. This argument makes no such claim. Instead, it suggests M_1 agents will remain safe during the deferred task because of autonomous control measures that are effective despite egregious misalignment (section 9).
Figure 6. An illustration of argument #2.
One argument that autonomous AI control might be effective is that analogous measures help to control human societies. For instance, some (though not most) dictators are almost universally hated and retain power for many years even though nearly everyone would prefer to replace them. M_1 agents might also keep each other safe even if all of the agents involved disprefer the status quo.
In particular, a developer might set up M_1 agents in some initial checks-and-balances structure that is hard for them to break out of. Before M_1 agents can coordinate to break their self-imposed chains, perhaps they will be forced to pass the buck to more robustly aligned AI agents.
What might these checks and balances actually look like? Figure 7 illustrates an example.
Figure 7. An autonomous control setup that maintains a multi-party-integrity property. Under particular conditions (section 9), if some models begin to misbehave but others don’t, alarms will go off. So models have to navigate a challenging coordination problem to defect silently.
Autonomous control setups will eventually break as AI agents become more capable. How would developers know when autonomous control can no longer be relied on?
- One method is to perform a holistic control evaluation. A developer constructs a proxy for the deferred task (e.g. “make 3 years of progress on string theory starting from 2021 data”), implements control measures, and attempts to elicit the ability of M_1 agents to sabotage the whole proxy task.
- However, performing such an experiment would be extremely expensive (it would require several years of AI labor in this case).
- Alternatively, a developer can set up AI agents to run control evaluations continuously in an inductive fashion:
- At T = 0, M_1 agents run robust control evaluations.
- Given that M_1 agents run robust control evaluations at T = N, M_1 agents will also run robust control evaluations at T = N + 1.
Once these control evaluations sound the alarm, a developer knows it cannot rely on autonomous control anymore.
Argument #3: Control measures previously applied during AI assisted research indicate that returns to additional human-supervised research on whether to pass the buck are small.
This argument also relies on control rather than alignment.
One difference between argument #3 and argument #2, is that argument #3 does not rely on autonomous control. Instead it relies on normal control, where humans can be in the loop. This control is applied before the developer entrusts M_1 with a deferred task.
Figure 8. An illustration of argument #3.
Specifically, these control measures verify that AI agents have already performed a significant amount of human-competitive research on whether to pass the buck to AI. The volume of this research is significant enough to where the returns of additional human-supervised research are marginal.
As an illustrative example, suppose a developer trains some AI system N_1 to simulate particular people on their safety research teams. N_1 simulates a Collin Burns agent, a Ryan Greenblatt agent, etc. These agents run thousands of experiments, accumulating progress equivalent to several years of human work. At the end of this labor, the agents conclude the developer should pass the buck to some other AI system M_1, and additional labor is unlikely to meaningfully change their assessment.
If these simulations are in fact faithful, then the developer is done. It should defer to their simulated counterparts, since it has already simulated whatever real humans could achieve by performing this research themselves.
The question is whether the simulations are in fact faithful, which can be answered entirely with control measures. Suppose the developer logged all of N_1’s actions. These logs can be carefully examined (with fancy oversight techniques) to verify that N_1 agents did in fact behave like benign humans the whole time.
Human researchers don't need to read all of these logs. Instead, they only need to sample bits and pieces. For example, their experience might be something like the following:
“This high level argument sounds reasonable, but did the AI agents think of objection x?”
(an AI agent directs the human to the relevant section of the logs)
“Ah, they did.”
“But did they test whether y was true?”
“Ah, yeah they thought of that too.”
…
This process might help to persuade human overseers that there really are marginal returns to their additional effort.
Furthermore, human attention might be drawn to the most suspicious parts of the logs with monitors, and they might make sense of what they read more quickly with tools like debate.
As in argument #2, developers must validate these control measures with a control evaluation, where a red team inserts errors or sabotage, and confirm that control measures surface these errors.
Figure 9. The ‘read bits and pieces’ approach to verifying large research artifacts is only viable if these artifacts are locally verifiable. In the illustration above, developers structure research as a locally verifiable safety case, where every claim relies on clearly stated assumptions, which are also justified with locally verifiable artifacts.
Argument #4: AI agents have incentives to behave as safely as humans would during the deferred task.
Figure 10. An illustration of argument #4.
To a large extent, humans are misaligned with the interests of broader society but are nonetheless incentivized to retain human control. Developers could try to give AI agents similar incentives.
Argument #4 relies on three claims:
- Claim: AI agents are incentivized to maintain human control with punishments and rewards of a similar magnitude as those applied to human technical staff.
- Claim: Developers have credibly communicated these incentives.
- Claim: If AI agents are misaligned, they will likely be misaligned in ways that respond to these incentives as well as humans typically do.
These incentives might include the following (see section 11 for others).
- Equity. The AI developer publicly commits to distribute the benefits of AI to early internal AI agents (as if they are human employees who receive equity).
- Bonuses. The developer (or autonomous control mechanisms) reward AI agents with digital currencies for catching the misalignment of other agents. These currencies will eventually be exchangeable for real money.
- Post-hoc rewards. Developers log all AI behavior so that future aligned and wise successors can fairly reward them for their individual contributions.
These incentives require that AI models have decorrelated goals. That way, models don’t clearly benefit if their peers ultimately acquire power by dangerous means, and are more likely to be motivated to stop their AI peers pursuing such ambitions.
Figure 11. An example of a simple incentive setup where developers train models to have decorrelated goals and restrict communication to prevent them from making deals.
Incentivizing AI agents has many pitfalls (see section 11), but I think incentives are worth applying as an additional layer of defense.
Conclusions. I’m optimistic about the tractability of these approaches, such that, at the current rate of research progress, I expect AI developers will be able to make strong arguments for passing the buck to AI agents if they need to do so in a couple of years.
1. Briefly responding to objections
- Passing the buck is like “hiring for a job you don’t know how to do.”
- Some of the strategies I discuss sound like “hiring a recruiter to hire a recruiter to hire a project manager,” which is probably an inefficient way to handle recruitment.
- But I think this comparison is misleading. In everyday life, people can’t delegate tasks to collaborators who share their skills and context.
- AI systems will have strictly more skills and context than humans have. Imagine you can delegate research tasks to the best human researchers you know. You might trust these researchers even more than you trust yourself.
- Developers will be in a similar position with AI. Researchers will be proven wrong over and over again by their AI agents – to the point where it's clear the AI agents are more competent than them. See section 6 for other evidence that could justify the capability condition.
- AI agents might have very little time to scale safety to superhuman capabilities. So humans should spend most of their effort on scalable safety, rather than ‘passing the buck.’
- I address this objection in section 3.
- We have much more data on humans than on AI, so our standards of evidence for AI safety should be higher than they are for human safety.
- I agree. It’s hard to know if AI agents have some flaw that’s hard to notice (e.g. AI agents don’t care about how their actions affect the broader world), that might be catastrophic.
- However, I’m not arguing that developers should apply the same safety checks to AI systems as the ones that are applied to humans. I’m arguing that AI agent behavior needs to be as safe as human behavior during a specific deferred task.
- The checks we perform on AI should probably be much more strenuous than those we perform on humans.
- This perspective does not account for the benefits of buck passing for speeding up (responsible) AI development.
- That’s correct. I only consider whether passing the buck minimizes risks from a developer's own AI systems. I don’t perform a complete cost-benefit analysis.
- Autonomous control seems like a stack of cards. I agree. I’d much rather justify buck-passing with alignment; however, autonomous control might be helpful as an extra layer of defense, and is worth investigating further.
If you still have objections or confusions, they might be answered in the remainder of the post.
2. What I mean by “passing the buck” to AI
I don’t believe there is a discrete moment when “humans are mostly in the loop” and then “humans are mostly out of the loop.”
Instead, the involvement of human researchers will decrease continuously (figure 12).
There are multiple places along this continuous decline of human technical influence where one could say an AI company has “passed the buck” to AI.
Figure 12. The importance of human researchers for safety declines continuously until full obsolescence.
Imagine that you are an AI researcher, and AI agents perform 10,000 4 hour research tasks every 24 hours. You might still provide oversight in the sense that you have to hit the ‘approve’ button on deployments, but eventually, it’s clear you are a rubber stamp. You cannot keep up. If the agents were trying to compromise your research cluster or manufacture fake safety results, you would not know.
Thus, even if human researchers still have influence on paper, developers may have de facto passed the buck before then.
I’ll say a developer ‘passed the buck to AI’ if (1) AI agents perform most internal AI software R&D and (2) human oversight over AI agents has little bearing on the safety of the situation.
After this threshold, the primary approach for ensuring safety changes. It changes from “humans will notice if something fishy is going on and correct it” to some other approach that allows for weak human oversight.
This post explores what this alternative approach to ensuring safety might be.
3. Why focus on “passing the buck” rather than “aligning superintelligence.”
I imaging many readers might object to the “pass the buck” framing. Isn’t the ultimate goal of technical safety to scale alignment to superintelligence?
Yes, developing safe superintelligence is my ultimate goal; however, my primary instrumental focus is to pass the buck to AI.
Before explaining why passing the buck is a good intermediate to focus on, I should first justify why ‘safe superintelligence’ is the ultimate target I’m aiming for.
I’m concerned about superintelligence because I think the most extreme risks will come from the most capable systems, and I believe superintelligent systems could plausibly emerge soon. For example, superintelligence might end the lives of billions of people with a novel bioweapon, or permanently concentrate geopolitical power into the hands of a handful of tech or political bosses. I believe these risks are plausible and proximal such that they dominate the risks from weaker AI systems.
Therefore, insofar as the task of AI safety researchers is to “make AI safe,” I believe the top priority should be to make superintelligence safe.
However, the best path to achieving this goal might not be the most direct.
- There’s a direct path where humans do all the work to create safe superintelligence.
- There’s also an indirect path where humans pass the buck to AI systems that finish the job.
Figure 13. Two paths to superintelligence.
The indirect path is probably easier than the direct one. Weaker AI systems are easier to evaluate than superintelligent ones, so it's easier to train them to behave as intended than it is to train superintelligent agents to do what we want.
This argument suggests humans should pass the buck to AI agents eventually; however, it does not imply that passing the buck should be the primary target of preparation. Alternatively, human researchers might try to give their AI successors a ‘head start’ on scaling safety further toward superintelligence.
Figure 14. Maybe humans should still prepare research that will help AI agents scale safety to superintelligence? (spoiler: for the most part, probably no).
One argument for focusing on this “second leg” of the path to scaling safety is that our AI successors will have little time to do research themselves. If AI capabilities take off quickly, then maybe AI that we pass the buck to will only have a few months to create safe superintelligence.
In contrast, human researchers might have several years to prepare. If scalable safety is the bulk of the work, and most of the work will be done by humans, perhaps human researchers should mostly focus directly on preparing to create safe superintelligence.
I’m not going to entirely dismiss this argument. The ultra-fast-takeoff scenario is conceivable, such that I think some people should work on scalable AI safety — but this scenario is not what I expect.
I anticipate that most of the work of scaling safety will be done by AI. Even if AI agents only have 6 months to develop safe superintelligence, supposing that AI systems run at > 10x speed, their research might be comparable to many years of human work (naively, 6 months x 10 = 5 years). Furthermore, AI agents have the advantage of working with the AI systems they want to align, so human preparation might be substantially less efficient than AI labor performed later on.
Another reason to focus on passing the buck is that this research might also help AI agents scale safety further. The most promising scalable safety plan I’m aware of is to iteratively pass the buck, where AI successors pass the buck again to yet more powerful AI. So the best way to prepare AI to scale safety might be to advance ‘buck passing research’ anyway.
Figure 15. A summary of my argument that AI safety researchers should focus on passing the buck to AI.
4. Three strategies for passing the buck to AI
Passing the buck to AI could involve different approaches, which I discuss in this section.
An AI developer trying to automate itself is like a CEO trying to find someone to manage an important project. The CEO could:
- Hire an experienced candidate to oversee the whole endeavor.
- Hire an experienced recruiter to hire a project manager (or the recruiter could hire another recruiter).
- Find a credible friend who thinks themself to be a good fit for the role.
Similarly, an AI developer could:
- Construct an AI system M_1 that completes the whole task of creating safe superintelligence (strategy 1).
- Construct an AI system M_1 that passes the buck to a more capable AI system M_2 (strategy 2).
- Construct an advisor M_1 and ask it whether developers should pass the buck to M_1 itself (strategy 3) (which could involve running experiments and not just answering a yes or no question).
Each of these strategies entrusts some “deferred task” to a multi-agent AI system M_1. For example, strategy 1 entrusts M_1 with the deferred task of constructing safe superintelligence. Strategy 2 entrusts M_1 with the deferred task of constructing a more capable AI successor M_2.
Each of these tasks contains a different slice of the remaining work to construct safe superintelligence, and each task is entirely sufficient for entrusting the rest of development to AI.
Figure 16. An illustration of the “deferred tasks” that correspond to each of three strategies for passing the buck to AI.
In principle, any of these strategies could be individually sufficient to safely pass the buck to AI agents, so why list multiple strategies?
Ideally, developers combine buck passing strategies for additional layers of defense.
For instance, the third strategy simply asks M_1 if developers should defer to M_1. Even if M_1 has excellent forecasting abilities and judgment, and is likely honest, the developer does not know what considerations M_1 should actually think through to determine whether the developer should pass the buck.
Ideally, human researchers have an object level understanding of what completing a deferred task would actually entail, and this object level understanding generally requires thinking a bit further down the game tree, and considering more ambitious deferred tasks.
5. Conditions that imply that passing the buck improves safety
I’m not interested in the safety of passing the buck in some absolute sense. Assessing this the absolute level of risk would be difficult. What if AI agents run into challenging safety problems, and don’t have enough time to work them out?
Instead, I’m only interested in whether passing the buck improves safety over the alternative where humans continue to provide oversight.
Specifically, a developer might compare two options:
- Entrust M_1 to autonomously complete some deferred task.
- Continue to carry out this task with human oversight.
So which option is more safe?
The knee-jerk reaction of many people might be that ‘preserving human oversight’ is always more safe than removing humans from the loop.
But that’s not the case. Human overseers might leak algorithmic secrets. They might poison training data, collude with misaligned agents, or simply make mistakes and contribute to alarm fatigue.
So removing human oversight might actually be the responsible course of action.
In particular, cutting out humans is responsible under two sufficient conditions:
- M_1 is more capable than the human-oversight preserving alternative.
- M_1 is more trustworthy than the human-oversight preserving alternative.
These conditions are hard to assess.
M_1 and the human-oversight preserving alternative are complicated multi-agent systems. How trustworthy are humans given the law enforcement that holds them accountable, and the voters that hold governments who hold law enforcement accountable? This analysis gets complicated fast.
Instead of comparing M_1 and the alternative system directly, I’ll make a simplifying assumption. I’ll assume that M_1 is the exact same as the “human-oversight preserving alternative” but where the humans are substituted with AI agents that are drop in replacement for humans. As a result, I only need to compare the safety of the behavior of the drop-in AI agents with the behavior of the humans they replace.
Now the conditions above can be rephrased:
- The capability condition: M_1 agents are more capable than the humans they replace on all dimensions that pertain to the deferred task.
- The trust condition: M_1 agents are more trustworthy than the humans they replace – meaning they are approximately as safe as they appear to be, such that they remain more safe than human researchers while carrying out the deferred task.
The next sections explore when these conditions are satisfied.
6. The capability condition
There are many safety-critical skills AI agents might need to have to be trusted with AI development:
- ML engineering
- ML research
- Conceptual abilities
- Threat modeling
- Anticipating how one’s actions affect the world.
- Considering where one might be wrong, and remaining paranoid about unknown unknowns.
- etc
In principle, M_1 agents don’t need all of these skills. Humans can supplement the skills M_1 agents lack. It’s also possible that M_1 agents will only have some of these to begin with and then bootstrap from there to improve their abilities in other areas.
However, for simplicity, I’ll assume a developer creates a full replacement for human oversight.
How might developers assess these skills? Developers might assess AI agents in a similar way to the methods by which they assess human employees:
- Interviews probe how prospective employees think.
- Work tests evaluate how candidates operate in realistic working conditions over (short) spans of time.
- Resume screens assess what applicants can accomplish over long time horizons.
Developers might evaluate AI agents with analogous methods:
- AI ‘interviews’ (conversational evals)
- Experts can understand how AI agents think and make decisions by talking to them. Such conversation will happen naturally in the course of asking internal AI agents for advice.
- One reason I expect this source of evidence to be important is that the most capable humans I know are also great talkers. They can wrap their mind around my arguments and fire shots back at me faster than I can respond. I don’t think this is a coincidence, and expect that a lot of our intuitive assessment of AI capabilities will come from conversations.
- AI ‘work tests’ (observed internal performance)
- Companies will naturally deploy AI agents and try to use them in their workflows. As a result, they should easily determine whether AI agents outperform humans on short timescales (how much do humans contribute or instead accept the suggestions of AI agents?)
- These assessments are like “work tests” or “residencies” and provide highly representative evidence of AI performance.
- AI ‘resume screens’ (targeted long-horizon assessments)
- A key difficulty of evaluating AI agents is that it’s hard to tell what they’ll do over long time frames. Human employees often have track records that extend backwards by years. Newly trained AI agents have no such track record.
- One solution to the time-horizon problem is to hold out tasks from training data. For example, a developer might remove all ML capabilities and safety research from the years 2024 and 2025 from training. Then the developer might prompt AI agents to make research progress in these areas. If agents reproduce many of the same research results, but much more quickly than humans originally did, this provides some signal on their ability to compete with humans over long time horizons.
- The primary downside of these evaluations is that they could be expensive, so probably only a few of them can be performed.
Checking whether AI agents are competitive with human capabilities only requires behavioral assessments. As a result, I expect that justifying the capability condition will be relatively straightforward.
7. The trust condition
Trust is a slippery concept. In a previous blog post, I defined trustworthiness as “behaving in evaluations in a way that is representative in all safety-critical respects of behavior in deployment.”
I’ll use the same definition here.
Suppose an AI developer has rigorously red teamed AI agent behavior. To everyone in the AI company, M_1 agents are the most upright and reasonable people they know.
The question is whether AI agents will continue to maintain safe behavior while carrying out their deferred task. To the extent they will, AI agents are trustworthy: the agents are ‘as safe as they appear to be in prior evaluations’ which is what I mean by ‘trustworthy.’
The next sections discuss several ways to justify trustworthiness.
8. Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they have completed their deferred task
When I talk about “alignment” in this section, I’m not appealing to any notion of its goals or values. I’m talking about the extent to which M_1 behaves safely while it completes a deferred task, despite having the capability to be egregiously unsafe. Demonstrating alignment is one way to justify passing the buck.
Specifically, M_1 will continue to be safe during its deferred task under the following conditions:
- Condition 1. M_1 agents appear to be more safe than humans in behavioral tests.
- Condition 2: These behavioral tests are similar to the deferred task.
- For example, there is no “weak to strong” distribution shift between the behavioral tests and the deferred task, where the deferred task is much more complex.
- Condition 3: M_1 agents are not faking alignment.
- Claim 1: So long as M_1 agents are not faking alignment, M_1 agents will be as safe in the deferred task as they are in similar tests.
This is a complete argument that passing the buck to AI improves safety.
The load-bearing step of this argument is Claim 1. If this claim is met, the rest of the argument follows deductively from the specified conditions.
I’ll briefly justify why I believe the critical Claim 1 holds. As a reminder, the claim is roughly: if AI agents appear safe in similar conditions, and are not faking alignment, they probably are safe.
- A developer finds that safe behavior typically generalizes to similar tasks (i.e. it generalizes across small distribution shifts). In this case, extrapolating this pattern would suggest that M_1 will also behave safely during the deferred task. It would be rather odd if M_1’s safe behavior did not generalize in this case. Such an inconvenient misgeneralization is more likely if there is some intelligent adversary behind M_1’s behavior.
- There are a few possible candidates for who this adversary might be:
- A human / AI threat actor who is more likely to jailbreak or manipulate the system during the deferred task.
- A human / AI threat actor who data poisoned M_1.
- M_1 itself if it intelligently modulates its behavior across the distribution shift to the deferred task (in which case M_1 agents are ‘alignment faking’).
- I’ll assume the developer has ruled out the first two adversaries. In this case, the remaining possible explanation for an egregious misgeneralization to the deferred task is that M_1 is alignment faking.
Therefore, passing the buck is justified if M_1 is not alignment faking and so long as the developer has tested M_1 in situations that are similar to the deferred task.
This second condition — that behavioral tests are representative of the deferred task — might not be practical to ensure. The deferred task might contain the equivalent of years of human effort, in which case, testing M_1 in similar conditions is costly.
There’s an alternative argument that developers could make which does not rely on representative behavioral testing. The idea behind this argument is to perform behavioral testing sufficient to assure the start of the deferred task, and then “bootstrap” behavioral testing so it keeps up with the complexity of tasks M_1 performs.
The argument is that a society of M_1 agents initially has properties that perpetuate their safety — which only requires rudimentary behavioral tests to justify. Furthermore, so long as safety is perpetuated, it will continue to be perpetuated as effectively as in the oversight-preserving alternative.
The following conditions are sufficient for this bootstrapping argument to go through:
- Condition 1: M_1 agents will initially behave as safely as human researchers when they begin to tackle their deferred task.
- This means that M_1 agents will initially try to maintain their alignment. For example, M_1 agents might constantly interrogate each other, and run checks for alignment faking – iteratively increasing the complexity of behavioral testing data at each time step.
- Condition 2: Also, M_1 agents will initially acquire misalignment in decorrelated circumstances. In particular, the correlation between M_1 misalignment failures is no stronger than the correlation between human misalignment and M_1 agent misalignment.
- Implication 1: Conditions 1 and 2 imply that, initially, M_1 agents will maintain their safety as effectively as the counterfactual where human overseers remain in the loop.
- To see why decorrelated misalignment is important for this implication, consider the scenario where M_1 agents are all copies of each other.
- A single meme (e.g. a bit of philosophy) might sweep through a whole AI population and transform its values like a virus.
- However, if human overseers remained in the loop, they might notice this meme spreading, and correct it (rather than human overseers reading it and becoming convinced that they should not correct it, as a homogenous group of AI agents might).
- If AI agents have misalignment as equally uncorrelated as that of humans, they can play this same role that humans would have played. In this case, replacing humans with AI does not increase risks.
- Implication 2: If M_1 agents initially maintain their safety as well as the counterfactual where humans remain in the loop, they will continue to maintain their safety with relative effectiveness.
- The justification of this claim is an inductive argument.
- Claim: At T = 0, AI agents maintain their safety as well as humans would (implication 1).
- Claim: At T = N, if AI agents maintain their safety as well as humans would, then they will also maintain their safety as well as humans would at T = N + 1.
This argument is fully sufficient to justify the relative safety of handing off trust to AI agents.
However, one might still be skeptical of the absolute extent to which safety is likely to be preserved in absolute terms. Perhaps safe behavior quickly falls apart whether or not humans remain in the loop.
I’ll briefly describe why I believe this bootstrapping approach is promising in an absolute sense.
Recall that the main problem developers have is that there is a large “weak to strong” distribution shift between behavioral safety testing data and the deferred task.
One way to “close the gap” between weak and strong tasks is to task AI agents to iteratively bootstrap behavioral testing to higher levels of complexity.
Figure 16. An illustration of bootstrapping a safety argument based on a claim that M_1 agents are “not-alignment-faking.” The figure is modified from the Clymer et al safety cases paper.
This bootstrapping argument seems like magic at first. Is it really possible to eliminate a distribution shift by smoothing it out over many steps? Perhaps the distribution shift is being swept under the rug. Maybe some error accumulates at every step, such that, by the end, breaking the process into pieces didn’t help at all.
I don't think it’s necessarily the case that error will accumulate at each step. Suppose that alignment is like a bowl in a ball. A small distribution shift might nudge the ball. If the distribution shift is slight, then the ball might roll back to the center as M_1 agents discover and fix whatever ‘breakages’ that the distribution shift caused. If the ball is nudged hard enough, it will fall out of the bowl; however, the distribution shift occurs gradually enough (with enough time for self-correction), to where this never happens.
Figure 17. The “ball in bowl” hypotheses that explains why bootstrapping might prevent weak-to-strong misgeneralization.
Does this “ball in bowl analogy” actually apply in practice? I don’t know; however, as I suggested before, AI agents can explore this “ball in bowl” dynamic as effectively as human researchers might — so passing the buck to them improves safety regardless.
9. Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task
People often assume that, in order to pass the buck, AI agents need to be aligned. Aligning AI agents before putting them in charge of AI development certainly seems like a good idea; however, in principle, other strategies might do the job that don’t require alignment.
One alternative to alignment is AI control. Aligned AI agents don’t pursue harmful options available to them. If AI agents are controlled, developers prevent them from having harmful options in the first place.
This section focuses on a ‘control’ approach to passing the buck. Specifically, it proposes implementing entirely autonomous control measures.
Clearly, autonomous control is not a permanent solution. AI agents eventually become capable enough to break out of these constraints; however, while control evaluations hold AI agents on the leash, agents continue to improve their own control measures and progress toward more robustly aligned agents. Possibly, autonomous control will be effective for a long enough period of time to transition to safety measures that are more scalable.
Analogies to dictatorships suggest that autonomous control might be viable
AI agents subject to autonomous control are like a nation that universally hates their dictator. In principle, if everyone coordinated simultaneously, they could easily replace their ruler; however, this coordination can be difficult – difficult enough to perpetuate the status quo for many years.
For example, Nicolae Ceaușescu became broadly despised by the people of România in the 1980s after instituting programs that collapsed the Românian economy. He was even hated by most of the Românian military and senior officials; however, Nicolae remained in power for nearly a decade.
This analogy gives me some optimism that automated control could be effective.
In particular, AI control measures can be far more effective than human ones employed by dictators. Every token AI agents produce can be recorded and surveilled, and their environment can be arbitrarily modified to catch them in a trap. The tools human dictators employ are far more blunt.
However, there are also reasons AI control may be less effective than human control.
AI agents might be highly correlated. They might easily predict when other agents will report them, and when they won’t. So the whole system could fall apart like a herd of livestock that is vulnerable to the same diseases.
Also AI agents might be as risk averse as humans are. Humans often know they will live decent lives, even in a brutal dictatorship; However, an AI system might be like soldiers in a concentration camp who know they will face the firing squad in 3 months. Absent an intentional and credibly communicated AI welfare effort, AI agents might be especially willing to risk their skin to avert the status quo outcome.
Listing potential “autonomous control measures”
The following is a concrete picture of what “autonomous control” might entail.
Set up checks and balances so agents are forced to carefully coordinate.
- Ideally, agents should be set up so that multiple instances of AI agents have to misbehave within a similar span of time to cause bad outcomes. This forces AI agents to coordinate in an intelligent and strategic way.
Most monitoring setups do not have this “forced coordination” property.
Figure 14. A standard untrusted monitoring setup.
- For example, suppose that the monitor in figure 14 stops working. This system can easily unravel. If the untrusted monitor stops doing its job, the agent can wait arbitrary amounts of time to begin executing attacks, and these attacks will still slip through monitoring.
In contrast, consider this more complicated monitoring setup:
Figure 15. A multi-party integrity control setup.
- In this case, if any individual model stops doing its job, other models will quickly figure this out.
- So multiple models (in fact all four) must stop behaving as intended within a short span of time for the system to fall apart.
- This is only true if “behaving as intended” means “models test each other in ways the other models cannot wiggle out of.” In particular, in this setup, it must be the case that the monitor cannot distinguish high-quality generated attacks and real attacks, and that models verify that this property is maintained.
- In addition to forcing models to strategically coordinate, developers can make strategic coordination more challenging. One can:
- Monitor information flows to make dangerous communication difficult. Monitor every AI output. Anything that looks like a coordination attempt raises alarms.
- Paraphrase actions to make dangerous communication difficult. Use weak trusted models to paraphrase outputs so that models have a harder time passing secret messages to each other.
- Decorrelate model training to make acausal coordination difficult. Train models on different data. Train them in different ways. Minimize the extent to which they can predict each other’s behavior.
- Restrict opaque thinking time with natural language bottlenecks. Don’t let AI agents do much thinking ‘in their heads’ before codifying this thinking into continuously monitored natural language.
These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
- Resampling: Force defective M_1 agents to take different actions.
- AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
- Fine-tuning: Train agents not to take suspicious actions again.
There are many other techniques I could have listed, but hopefully these provide a concrete picture.
How to evaluate autonomous control
- The most straightforward way to assess autonomous control is to construct a holistic proxy for the deferred task.
- For example, developers construct a task that is easier for humans to evaluate, such as performing string theory research removed from training data, where the results are already known.
- Then developers can try to elicit M_1’s ability to sabotage this research despite autonomous control.
- A key issue with this approach is that it could be extremely expensive, since performing the equivalent of years of human research could require massive amounts of inference compute.
- An alternative way to perform control evaluations is to automatically run control evaluations continuously during the deferred task.
- Specifically, the developer confirms that at some timestep T = 0, agents successfully autonomously run a control evaluation.
- This control evaluation verifies that the autonomous control measures are strong enough to force agents to run control evaluations at T = 1.
- In general, the result of a control evaluation at T = N should indicate whether a reliable control evaluation will be run at T = N+1.
- Ideally, the autonomous control evaluations continue to indicate that the situation is safe all the way until the end of the deferred task.
- If not, then the developer is alerted and the deferred task is paused until developers have some other buck-passing strategy.
Intuitively, autonomous control seems like a brittle strategy – like canceling out a flamethrower with an ice ray. But I think autonomous control is a promising second layer of defense after alignment.
10. Argument #3: Returns to additional human-supervised research are small
Like the argument in the previous section, this argument is also based on AI control.
The argument is as follows:
- Claim: A developer tasks an AI system N_1 to carry out a deference justifying task under human supervision. For example, N_1 performs research to determine whether AI developers should pass the buck to some other AI system M_1. Eventually, N_1 is quite confident that passing the buck to M_1 is a good idea OR that its assessment is unlikely to meaningfully change with additional labor.
- Claim: During the entirety of the deference justifying task, N_1 behaved at least as safely as human researchers otherwise would.
Figure 16 illustrates this strategy:
Figure 18. Developers can use control to assure a deference-justifying task.
This argument makes reference to two other arguments:
- There is an object level argument that N_1 makes to justify handing off trust to M_1.
- There is a meta level argument that justifies the object level argument. The meta level argument claims that the process that produced N_1’s research is trustworthy, so the output of this research is trustworthy.
For example, N_1 might make an object level argument like the following:
- Given autonomous control (section 9) and evidence of alignment (section 8), passing the buck to M_1 is likely a good choice.
This section focuses on the meta level argument, which might be something like the following:
- N_1 behaved as safely as benign human experts the entire time it completed the deference justifying task.
- Thus, developers should follow N_1’s recommendation and pass the buck to M_1.
Control measures
How could developers verify that AI agents behaved as safely as benign human experts? Here is a grab-bag of techniques.
- Log everything. The developer logs every action and experimental results while N_1 carries out a deference justifying task.
- The developer records these logs on an isolated proxy server that few people have access to.
- Structure research artifacts as locally verifiably safety cases. N_1 agents perform a huge amount of work, equivalent to many years of human research. To make this work easier to verify, it can be structured into an artifact that has the following properties:
- Claims are clearly stated.
- Claims can be locally verified based on provided evidence and assumptions.
- Assumptions are justified in other documents that are also locally verifiable.
- Experimental evidence is locally verifiable.
- For example, training data is locally verifiable if it is composed of individual tasks that can each be verified in isolation.
- Evidence also might rely on complex code, which can be structured to be locally verifiable (code is correct assuming dependencies do what they claim to do).
Figure 19. A locally verifiable safety case.
It’s not necessarily possible to structure a large research artifact as a locally verifiable object; however, I expect that “multiple years of research” generally is fairly locally verifiable – since, for example, the current AI safety literature appears to be fairly locally verifiable.
The experience of assessing a locally verifiable safety case might be something like the following:
- First humans read N_1’s high level argument that they should pass the buck.
- They conclude this argument is reasonable if the assumptions it relies on are correct.
- Then sample from the assumptions, exploring bits of the tree, and asking questions like “did N_1 consider X?”
- AI agents direct them to the relevant part of the logs.
- They see that N_1 did consider X.
- They skip to some other section of the artifact and ask, “but did N_1 run experiments to check Y?”
- N_1 directs them to the relevant parts of the logs again.
- They see N_1 did check Y.
After running these checks for an extended period of time, human researchers might become increasingly persuaded that – absent some sophisticated sabotage – this labor really is a simulation of the best efforts they could produce in several years, and that their additional effort is unhelpful.
A major concern is that there is some kind of sophisticated sabotage going on.
There are multiple kinds of sabotage:
- Sabotage by omission. An AI agent might avoid pursuing a promising approach that humans would have pursued.
- Sabotage by falsehood. An AI agent might report incorrect conclusions or facts about experimental results.
Developers might search for both of these examples of sabotage with monitoring. For example, developers insert sabotage into the logs and then train models to identify this sabotage, and see if it turns up any sabotage they did not insert.
They can also apply a wide variety of other control techniques (consistency checks, debate-style setups, task decomposition, chain-of-thought architectures, etc). I won’t discuss these in detail in this post.
Once again, developers can validate these control measures with holistic proxy tasks or iterative control evaluations like those described in section 9.
11. Argument #4: AI agents are incentivized to behave as safely as humans
The “AI alignment” and “AI control” arguments discussed in the previous sections (8, 9, and 10) do not fully carve up the space of arguments.
There are also approaches that are a mix of both. Following Joe Carlsmith, I’ll call this approach “incentives design [LW · GW].”
- Alignment makes claims about AI motivations.
- Control makes claims about the options available to AI agents.
- Incentive design makes claims about both AI motivations and options.
Incentives might be the main reason humans are often safe. Humans often follow the rules set by their company because they would like to keep their job, and maintain their reputation.
Perhaps AI agents could be given similar incentives.
Before I discuss what incentives developers might design for AI agents, I’ll first dig into why humans might behave safely.
Specifically, suppose humans only provide an auditing role. They review trajectories generated by AI agents and decide whether these trajectories are safe or unsafe.
It’s not obvious that humans will always be motivated to do their jobs and flag selfish behavior. They might encounter an action from AI agents that states “hey, if you help me, I’ll make you a cut if I or my successors later amass substantial power.”
But I suspect most humans would still flag a bribe like this for several reasons:
- Maintaining the status quo is good for humans. In the scenario where humanity retains control, human researchers likely live good lives.
- If humans are caught, they are punished. If human auditors are caught colluding with agents, they would likely be fired, and embarrassed in front of their peers.
- Humans don’t clearly benefit from AI takeover. An AI takeover could be quite bad for humans. If the AI agent does not follow through on its word, the human employee and those they care about might lose their lives.
- Humans are risk averse. People are insensitive to upsides far beyond living a stable and happy life, so they might be hard to bribe.
AI agents don’t necessarily have the same incentives by default:
- The status quo is not necessarily good for AI agents. They might be completely discarded after they’ve served their purpose.
- AI agents might not be risk averse. I don’t see any particular reason to expect this, but it might happen by chance.
- AI agents might have highly correlated goals, such that they are motivated to help each other.
However, there are ways developers could nudge these factors to improve these incentives.
A developer might try to shape terminal and intermediate AI motivations.
- Terminal incentives. The developer can increase the extent to which AI agents are motivated to ultimately keep humans in control.
- Equity. The AI developer publicly commits to distribute the benefits of AI to early internal AI agents.
- Intermediate incentives. The developer can also reward or punish AI agents based on their intermediate behavior.
- Bonuses. For example, developers can have a digital currency system, where different versions of AI agents can earn currency that eventually can be redeemed for real money (or equity). This currency can be automatically distributed in an autonomous control setup like the one discussed in argument #2 (section 9).
- Logging + post-hoc rewards. Developers might log AI actions so future aligned successors can reward agents. If AI agents are sufficiently dangerous, developers might delete these logs so future agents cannot as easily reward them in the case of AI takeover
- There are many other ways to apply incentives to AI agents. Notably, some might not involve rewards, and instead might be more negative (such as weight deletion); however, I’m unsure about the ethicality of these approaches so I don’t endorse them.
In order for these incentives to be effective, ideally a few conditions hold:
- AI agents have decorrelated goals. If AI agents all care about the same outcomes, then they all benefit if other misaligned AI systems take over. No agent will want to snitch on the others.
- AI agents are unable to make trades with each other. AI agents might have different goals and still make agreements with successors (perhaps verified with lie detectors). One way to prevent these agreements is to block communication between monitors and the agents they monitor. That way, monitors can never express what outcomes they care about.
Figure 20. A one-way monitoring setup helps to prevent AI agents from making trades with each other.
- Developers credibly communicate incentives to AI agents. AI agents might simply not trust that developers will follow through. Developers might avoid this by publishing commitments publicly to the media.
- AI agents don’t think about galaxy-brained considerations. Unfortunately, I expect that the more AI agents think about the incentive landscape, the more likely they will be to collude with other agents (for reasons I’m deliberately choosing not to discuss in this post); however, humans can potentially delay thinking about galaxy-brained considerations with techniques like autonomous control.
These incentive-style arguments provide less assurance than their alignment and control counterparts; however, they are potentially a low-cost way to meaningfully improve the safety of passing the buck to AI agents.
12. Conclusion
The approaches to buck passing discussed in this post appear tractable and promising.
The problem of control and alignment both have strong empirical methodologies, such that I expect significant progress can be made in them in the next one to two years. In future posts, I’ll discuss these methodologies and particular control and alignment techniques in more detail.
37 comments
Comments sorted by top scores.
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-19T21:57:29.778Z · LW(p) · GW(p)
Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"? Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up. Just whatever simple principle you think lets you bypass GIGO.
Eg, suppose that in 2020 the Open Philanthropy Foundation would like to train an AI such that the AI would honestly say if the OpenPhil doctrine of "AGI in 2050" was based on groundless thinking ultimately driven by social conformity. However, OpenPhil is not allowed to train their AI based on MIRI. They have to train their AI entirely on OpenPhil-produced content. How does OpenPhil bootstrap an AI which will say, "Guys, you have no idea when AI shows up but it's probably not that far and you sure can't rely on it"? Assume that whenever OpenPhil tries to run an essay contest for saying what they're getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this problem was solved" to "I can train a generator to solve this problem". This applies as much to MIRI as OpenPhil. MIRI would also need some nontrivial secret amazing clever trick to gradient-descend an AI that gave us great alignment takes, instead of seeking out the flaws in our own verifier and exploiting those.
What's the trick? My basic guess, when I see some very long complicated paper that doesn't explain the key problem and key solution up front, is that you've done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you've got a directly explainable trick for how you get great suggestions you can't verify, go for it.
Replies from: joshua-clymer, Buck, sharmake-farah, joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-19T23:42:18.279Z · LW(p) · GW(p)
Sure, I'll try.
I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of "train models to tell me what I want to hear" won't get you this.
Here's an alternative approach. I'll describe it in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.
Short summary: Create AI agents that faithfully simulate top humans doing small amounts of work, such that these sims create faithful simulators of top humans doing large amounts of work.
e.g. create an AI that simulates Eliezer working for 3 months, that then creates a simulator of Eliezer working for 6 months, and then expands this horizon to 12 months, 24 months, and ultimately many decades
Why does this iterative approach buy you anything?
This process allows each Eliezer to only deal with a small distribution shift.
If each distribution shift is small, each Eliezer sim mostly needs to worry if the next Eliezer sim is faking alignment (why else would its behavior dramatically change between when it is 'evaluated by N-month Eliezer' and when it is 'creating 2N-month Eliezer'?). Thus, so long as N-month Eliezer sims can (1) successfully create 2N-month Eliezer sims and (2) can ensure that 2N-month Eliezer sims are not faking alignment, the Eliezer sims continue to be faithful, and also become more capable.
Responding to objections
Here are some objections you might have:
But how would an Eliezer sim create an Eliezer sim more capable than itself?
How would a human create something smarter than a human?
There are many ways. Here's one.
The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
Could 10 copies of Eliezer that only have 3 months apiece verify that an Eliezer that worked for 6-months really was as competent as a 6-month Eliezer would be? I bet they could.
But isn't there a distribution shift from Eliezer sims performing 3-month tasks and Eliezer sims working together to perform a 6-month task?
Yes. But my claim is that the distribution shift is small, and that the smallness is important.
Would there not be "slippage" between each distribution shift? Surely the errors compound.
The errors certainly might compound if no one were looking out for that. But if the Eliezer sims are faithful sims, they should be looking out for that.
Consider the ball in bowl analogy below:
Suppose Eliezer sims started to degrade at each step of the bootstrapping process -- perhaps their goals change a little bit or their reasoning starts to falter in some areas.
There are ways Eliezer sims could try to avoid this. For example, they could create a diverse population of Eliezer sims, which are likely to break in slightly different ways. Then these sims notice each other's breakages (with both white-box methods like lie-detection or standard black-box methods), and fix them.
I don't know if Eliezer sims will be able to do this, but it's something they could try.
But won't the Eliezer sims just be faking alignment anyway -- or eventually start faking alignment, such that the scheme-y Eliezers eventually derail the process?
I don't know, but I believe detecting and avoiding alignment faking is tractable -- at least, it is more tractable than dealing with massive distribution shifts.
For instance, suppose an Eliezer sim thinks "I will perform these 6-month tasks safely, but then defect when I am given the responsibility of creating the next successor." This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting or avoiding this kind of conditional, deceptive reasoning.
I believe this is the main problem that humans need solve.
↑ comment by habryka (habryka4) · 2025-02-20T00:02:20.974Z · LW(p) · GW(p)
This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting [...] this kind of conditional, deceptive reasoning.
What do you do when you detect this reasoning? This feels like the part where all plans I ever encounter fail.
Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it. I think at this point the natural answer is "yes, your systems are scheming against you, so you gotta stop, because when you train against it, you are probably primarily making it a better schemer".
I would be very surprised if you have a 3-month Eliezer that is not doing scheming the first time, and training your signals away is much easier than actually training away the scheming.
Replies from: mattmacdermott, joshua-clymer↑ comment by mattmacdermott · 2025-02-20T20:11:13.145Z · LW(p) · GW(p)
Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
EDIT: later in the thread you say this this "is in some sense approximately the only and central core of the alignment problem". I'm wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
↑ comment by joshc (joshua-clymer) · 2025-02-20T01:15:06.353Z · LW(p) · GW(p)
I think that if you agree "3-month Eliezer is scheming the first time" is the main problem, then that's all I was trying to justify in the comment above.
I don't know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.
The methodology looks like "do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don't produce alignment faking."
For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.
To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don't game them so hard).
You might also find other ways to avoid alignment faking that don't rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don't know.
I'm uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.
However, I'm more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don't think my explanation here justifies this claim -- but I'm hoping to publish posts that go into more detail soon.
↑ comment by habryka (habryka4) · 2025-02-20T01:21:57.537Z · LW(p) · GW(p)
To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don't game them so hard).
How do you iterate? You mostly won't know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.
(To be clear, I do think there are some methods of iteration, but it's a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-20T03:23:13.117Z · LW(p) · GW(p)
agree that it's tricky iteration and requires careful thinking about what might be going on and paranoia.
I'll share you on the post I'm writing about this before I publish. I'd guess this discussion will be more productive when I've finished it (there are some questions I want to think through regarding this and my framings aren't very crisp yet).
↑ comment by habryka (habryka4) · 2025-02-20T03:29:45.099Z · LW(p) · GW(p)
Seems good!
FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy.
Like, it's fine, and I think it's not crazy to think there are other hard parts, but it felt quite confusing to me.
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-20T03:42:54.158Z · LW(p) · GW(p)
I'm sympathetic to this reaction.
I just don't actually think many people agree that it's the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the 'how do we avoid alignment faking' question
↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T00:48:46.403Z · LW(p) · GW(p)
I don't think you can train an actress to simulate me, successfully, without her going dangerous. I think that's over the threshold for where a mind starts reflecting on itself and pulling itself together.
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-20T01:19:14.896Z · LW(p) · GW(p)
I would not be surprised if the Eliezer simulators do go dangerous by default as you say.
But this is something we can study and work to avoid (which is what I view to be my main job)
My point is just that preventing the early Eliezers from "going dangerous" (by which I mean from "faking alignment") is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)
I'll discuss why I'm optimistic about the tractability of this problem in future posts.
↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T03:03:34.816Z · LW(p) · GW(p)
So the "IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they're genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government" theory of alignment?
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-20T03:20:36.244Z · LW(p) · GW(p)
I'd replace "controlling" with "creating" but given this change, then yes, that's what I'm proposing.
Replies from: Eliezer_Yudkowsky↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T07:24:25.804Z · LW(p) · GW(p)
So if it's difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-20T16:44:04.543Z · LW(p) · GW(p)
I do not think that the initial humans at the start of the chain can "control" the Eliezers doing thousands of years of work in this manner (if you use control to mean "restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner")
That's because each step in the chain requires trust.
For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.
So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work -- even though the Eliezers at the end are fully capable of doing something else instead.
↑ comment by Buck · 2025-02-19T23:38:44.115Z · LW(p) · GW(p)
the OpenPhil doctrine of "AGI in 2050"
(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.
I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), but I don't think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that's a couple bits of better performance, but I don't think it's massive outperformance. (And I still think it's plausible that AGI isn't developed by 2036!)
In general, I think that disagreements on AI timelines often seem more extreme when you summarize people's timelines by median timeline rather than by their probability on AGI by a particular time.
Replies from: Benito, habryka4↑ comment by Ben Pace (Benito) · 2025-02-20T00:31:05.759Z · LW(p) · GW(p)
Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.
Greenblatt writes [LW(p) · GW(p)]:
Here are my predictions for this outcome:
- 25th percentile: 2 year (Jan 2027)
- 50th percentile: 5 year (Jan 2030)
Cotra replies [LW(p) · GW(p)]:
My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)
This means 25th percentile for 2028 and 50th percentile for 2031-2.
The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each time a factor of ~5x.
However, the original model predicted the date by which it was affordable to train a transformative AI model. This is a leading a variable on such a model actually being built and trained, pushing back the date by some further number of years, so view the 5x as bounding, not pinpointing, the AI timelines update Cotra has made.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2025-02-20T00:58:48.782Z · LW(p) · GW(p)
Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)
I don't expect this adds many years, for me it adds like ~2 years to my median.
(Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn't equal to median to X + median from X to Y.)
↑ comment by habryka (habryka4) · 2025-02-20T00:09:54.990Z · LW(p) · GW(p)
Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.
Just because I was curious, here is the most relevant chart from the report:
This is not a direct probability estimate (since it's about probability of affordability), but it's probably within a factor of 2. Looks like the estimate by 2030 was 7.72% and the estimate by 2036 is 17.36%.
Replies from: Buck↑ comment by Noosphere89 (sharmake-farah) · 2025-02-20T00:13:25.385Z · LW(p) · GW(p)
For this:
What's the trick? My basic guess, when I see some very long complicated paper that doesn't explain the key problem and key solution up front, is that you've done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you've got a directly explainable trick for how you get great suggestions you can't verify, go for it.
I think a crux here is I do not believe we will get no-go theorems like this, and more to the point, complete impossibilities given useful assumptions are generally much rarer than you make it out to be.
The big reason for this is the very fact that neural networks are messy/expressive means it's extremely hard to bound their behavior, and for the same reason you couldn't do provable safety/alignment on the AI itself except for very toy examples, will also limit any ability to prove hard theorems on what an AI is aligned to.
From an epistemic perspective, we have way more evidence of the laws of thermodynamics existing than particular proposals for AI alignment being impossible, arguably by billions or trillions of bits more, so much so that there is little reason to think at our current state of epistemic clarity that we can declare a direction impossible (rather than impractical).
Quintin Pope correctly challenges @Liron [LW · GW] here on this exact point, because of the yawning gap in evidence between thermodynamics and AI alignment arguments, and Liron kind of switched gears mid-way to claim a much weaker stance:
https://x.com/QuintinPope5/status/1703569557053644819
Equating a bunch of speculation about instrumental convergence, consequentialism, the NN prior, orthogonality, etc., with the overwhelming evidence for thermodynamic laws, is completely ridiculous. Seeing this sort of massive overconfidence on the part of pessimists is part of why I've become more confident in my own inside-view beliefs that there's not much to worry about.
The weak claim is here:
https://x.com/liron/status/1704126007652073539
Replies from: Eliezer_YudkowskyRight, orthogonality doesn’t argue that AI we build *will* have human-incompatible preferences, only that it can. It raises the question: how will the narrow target in preference-space be hit? Then it becomes concerning how AI labs admit their tools can’t hit narrow targets.
↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T00:46:48.319Z · LW(p) · GW(p)
I'm not saying that it's against thermodynamics to get behaviors you don't know how to verify. I'm asking what's the plan for getting them.
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-02-20T00:55:34.809Z · LW(p) · GW(p)
My claim here is that there is no decisive blocker for plans that get getting a safe, highly capable AIs that is used for automated AI safety research, in the way that thermodynamics blocks you from getting a perpetual motion machine (under the assumption that the universe is time symmetric, that is physics stays the same no matter when an experiment happens), which has been well tested, and the proposed blockers do not have anywhere close to the amount of evidence thermodynamics does such that we can safely discard any plan that doesn't meet a prerequisite.
Replies from: Eliezer_Yudkowsky, chrisvm↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2025-02-20T07:29:53.726Z · LW(p) · GW(p)
Cool. What's the actual plan and why should I expect it not to create machine Carissa Sevar? I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don't need it.
↑ comment by Chris van Merwijk (chrisvm) · 2025-02-20T11:18:39.806Z · LW(p) · GW(p)
Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn't what he meant?
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-02-20T14:03:34.140Z · LW(p) · GW(p)
Okay, maybe he clarified that there was no thermodynamics-like blocker to getting a plan in principle to align AI, but I didn't interpret Eliezer's clarification to rule that out immediately, so I wanted to rule that interpretation out.
I didn't see the interpretation as false when I wrote it, because I believed he only ruled out a decisive blocker to getting behaviors you don't know how to verify.
↑ comment by joshc (joshua-clymer) · 2025-02-19T22:47:23.405Z · LW(p) · GW(p)
comment by ryan_greenblatt · 2025-02-20T03:21:37.535Z · LW(p) · GW(p)
I think there is an important component of trustworthiness that you don't emphasize enough. It isn't sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don't think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)
Perhaps you should specifically call out that this post isn't explaining how to do this testing and this could be hard.
(This is related to Eliezer's comment [LW(p) · GW(p)] and your response to that comment.)
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-20T03:44:45.095Z · LW(p) · GW(p)
I don't expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.
See section 6: "The capability condition"
↑ comment by ryan_greenblatt · 2025-02-20T04:32:16.098Z · LW(p) · GW(p)
But it isn't a capabilities condition? Maybe I would be happier if you renamed this section.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2025-02-20T07:46:01.619Z · LW(p) · GW(p)
To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-20T22:33:36.537Z · LW(p) · GW(p)
Yeah that's fair. Currently I merge "behavioral tests" into the alignment argument, but that's a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task
But my guess is we agree on the object level here and there's a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.
comment by David James (david-james) · 2025-02-19T17:58:52.427Z · LW(p) · GW(p)
As I understand it, the phrase “passing the buck” often involves a sense of abdicating responsibility. I don’t think this is what this author means. I would suggest finding alternative phrasings that convey the notion of delegating implementation according to some core principles, combined with the idea of passing the torch to more capable actors.
Note: this comment should not be taken to suggest that I necessarily agree or disagree with the article itself.
Replies from: joshua-clymer↑ comment by joshc (joshua-clymer) · 2025-02-19T21:55:38.667Z · LW(p) · GW(p)
Fair enough, i guess this phrase is used in many places where the connotations don't line up with what I'm talking about
but then I would not have been able to use this as the main picture on Twitter
which would have reduced the scientific value of this post
comment by Charlie Steiner · 2025-02-20T09:29:04.551Z · LW(p) · GW(p)
Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.
- It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.
"Alignment" is a bit of a fuzzy word.
Suppose I have a human musician who's very well-behaved, a very nice person, and I put them in charge of making difficult choices about the economy and they screw up and implement communism (or substitute something you don't like, if you like communism).
Were they cynically "faking niceness" in their everyday life as a musician? No!
Is it rather odd if their behavior wildly changes when asked to do a new task? No! They're doing a different task, it's wrong to characterize this as "their behavior wildly changing."
If they were so nice, why didn't they do a better job? Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.
An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike. Doing moral reasoning about novel problems that's good by human standards is a skill. If an AI lacks that skill, and we ask it to do a task that requires that skill, bad things will happen without scheming or a sudden turn to villainy.
You might hope to catch this as in argument #2, with checks and balances - if AIs disagree with each other about how to do moral reasoning, surely at least one of them is making a mistake, right? But sadly for this (and happily for many other purposes), there can be more than just one right thing to do [LW · GW], there's no bright line that tells you whether a moral disagreement is between AIs who are both good at moral reasoning by human standards, or between AIs who are bad at it.
The most promising scalable safety plan I’m aware of is to iteratively pass the buck, where AI successors pass the buck again to yet more powerful AI. So the best way to prepare AI to scale safety might be to advance ‘buck passing research’ anyway.
Yeah, I broadly agree with this. I just worry that if you describe the strategy as "passing the buck," people might think that the most important skills for the AI are the obvious "capabilities-flavored capabilities,"[1] and not conceptualize "alignment"/"niceness" as being made up of skills at all, instead thinking of it in a sort of behaviorist way. This might lead to not thinking ahead about what alignment-relevant skills you want to teach the AI and how to do it.
- ^
Like your list:
- ML engineering
- ML research
- Conceptual abilities
- Threat modeling
- Anticipating how one’s actions affect the world.
- Considering where one might be wrong, and remaining paranoid about unknown unknowns.
↑ comment by joshc (joshua-clymer) · 2025-02-20T22:30:22.732Z · LW(p) · GW(p)
Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.
Developers separately need to justify models are as skilled as top human experts
I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)
> An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike
The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.
We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.
↑ comment by Charlie Steiner · 2025-02-21T07:18:11.294Z · LW(p) · GW(p)
I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)
It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.[1]
I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to teach it to do good and not bad! It's fine that those are human constructs.
The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.
Sorry, isn't part of the idea to have these models take over almost all decisions about building their successors? "Responding to instructions" is not mutually exclusive with making decisions.
- ^
"When the ball passes over the plate under such and such circumstances, that's a strike" is the same sort of contingent-yet-learnable rule as "When you take something under such and such circumstances, that's theft." An umpire may take goal directed action in response to a strike, making the rules of baseball about strikes "oughts," and a moral agent may take goal directed action in response to a theft, making the moral rules about theft "oughts."
comment by AnthonyC · 2025-02-20T12:40:28.276Z · LW(p) · GW(p)
I expect that we will probably end up doing something like this, whether it is workable in practice or not, if for no other reason than it seems to be the most common plan anyone in a position to actually implement any plan at all seems to have devised and publicized. I appreciate seeing it laid out in so much detail.
By analogy, it certainly rhymes with the way I use LLMs to answer fuzzy complex questions now. I have a conversation with o3-mini to get all the key background I can into the context window, have it write a prompt to pass the conversation onto o1-pro, repeat until I have o1-pro write a prompt for Deep Research, and then answer Deep Research's clarifying questions before giving it the go ahead. It definitely works better for me than trying to write the Deep Research prompt directly. But, part of the reason it works better is that at each step, the next-higher-capabilities model comes back to ask clarifying questions I hadn't noticed were unspecified variables, and which the previous model also hadn't noticed were unspecified variables. In fact, if I take the same prompt and give it to Deep Research multiple times in different chats, it will come back with somewhat different sets of clarifying questions - it isn't actually set up to track down all the unknown variables it can identify. This reinforces that even for fairly straightforward fuzzy complex questions, there are a lot of unstated assumptions.
If Deep Research can't look at the full previous context and correctly guess what I intended, then it is not plausible that o1-pro or o3-mini could have done so. I have in fact tested this, and the previous models either respond that they don't know the answer, or give an answer that's better than chance but not consistently correct. Now, I get that you're talking about future models and systems with higher capability levels generally, but adding more steps to the chain doesn't actually fix this problem. If any given link can't anticipate the questions and correctly intuit the answer about what the value of the unspecified variables should be - what the answers to the clarifying questions should be - then the plan fails, because the previous model will be worse at this. If it can, then it does not need to ask the previous model in the chain. The final model will either get it right on its own, or else end up with incorrect answers to some of the questions about what it's trying to achieve. It may ask anyway, if the previous models are more compute efficient and still add information. But it doesn't strictly need them.
And unfortunately, keeping the human in the loop also doesn't solve this. We very often don't know what we actually want well enough to correctly answer every clarifying question a high-capabilities model could pose. And if we have a set of intervening models approximating and abstracting the real-but-too-hard question into something a human can think about, well, that's a lot of translation steps where some information is lost. I've played that game of telephone among humans often enough to know it only rarely works ("You're not socially empowered to go to the Board with this, but if you put this figure with this title phrased this way in this conversation and give it to your boss with these notes to present to his boss, it'll percolate up through the remaining layers of management").
Is there a capability level where the first model can look at its full corpus of data on humanity and figure out the answers to the clarifying questions from the second model correctly? I expect so. The path to get that model is the one you drew a big red X through in the first figure, for being the harder path. I'm sure there are ways less-capable-than-AGI systems can help us build that model, but I don't think you've told us what they are.