Why do misalignment risks increase as AIs get more capable?

post by ryan_greenblatt · 2025-04-11T03:06:50.928Z · LW · GW · 6 comments

Contents

6 comments

It's generally agreed that as AIs get more capable, risks from misalignment increase. But there are a few different mechanisms by which more capable models are riskier, and distinguishing between those mechanisms is important when estimating the misalignment risk posed at a particular level of capabilities or by a particular model.

There are broadly 3 reasons why misalignment risks increase with capabilities:

As there are multiple reasons why risks increase with capabilities, I expect that risk will increase rapidly with capabilities around the point when AIs are capable enough to obsolete large fractions of human cognitive labor. Depending on the rate of capabilities progress, this could mean that risk increases rapidly over time.

My current sense is that we'll probably see capabilities (and other AI properties) which make natural egregious misalignment plausible (>5% likely) substantially before we see an overall level of capabilities and affordances such that the overall level of risk (in the absence of focused countermeasures) is that high (>3% as bad as AI takeover). However, this depends on properties of the models such as whether they have neuralese recurrence. In worlds where we retain basically legible chain-of-thought, I think this prediction is less likely to be true: egregious misalignment might only be plausible once capabilities and affordances are already very high. Worlds where we retain legible chain-of-thought generally have substantially lower risk, so I focus less on these worlds when thinking about mitigations.

To make a concrete guess, I think AIs which 3x AI R&D (as defined here [LW · GW]), and which have capabilities, properties, and affordances similar to my modal guess (also discussed here [LW · GW]), pose risks which are roughly as bad as 0.25% of AI takeover if basically unmitigated. (As in, you'd be willing to trade off these risks for a 0.25% chance of AI takeover.) By “basically unmitigated”, I mean that the AI company doesn't specifically make an effort to handle non-obvious egregious misalignment risks for these AIs while still doing things which are useful for other reasons (but we pretend this doesn't update us about whether this company will address misalignment for later AIs). For AIs which 10x AI R&D I think the corresponding basically unmitigated risk is closer to 5%. For AIs which are just capable enough to fully automate AI R&D (such that humans are no longer adding substantial value), I'd guess 25%. These numbers are of course very uncertain and quite sensitive to the exact properties of these systems (which we'll be able to directly inspect at the time).

This has some implications for "trustedness evaluations": capability evaluations which assess whether AIs are plausibly egregiously misaligned (based on e.g., their level of opaque reasoning ability and other model properties [LW · GW]). Trustedness evaluations are both useful for (1) determining when you should start worrying about misalignment and (2) determining which (weaker) models you can treat as trusted in the context of control [AF · GW]. Because I expect that misalignment is plausible (and thus trustedness evals will trigger) before the bulk of the risk is present, usage (1) now seems less important to me.[1] Before there will be huge risks, I expect that AIs will be so clearly powerful and useful (e.g., near the point of automating research engineers at AI companies) that arguing for misalignment risks will be substantially easier. So, I think work focused on demonstrating misalignment is plausible such that costly (control) measures are triggered should focus on a point when capabilities are quite high and AIs are already used heavily at least within the AI company.


  1. Using trustedness evals for (1) also maybe requires these evaluations to produce legible enough evidence that it convinces decision makers; this poses additional difficulties. ↩︎

6 comments

Comments sorted by top scores.

comment by ChristianKl · 2025-04-11T15:19:00.386Z · LW(p) · GW(p)

What's the difference between being given affordances and getting power? If you are given more affordances you have more power. However seeking is about doing things that increase the affordances you have. 

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-04-11T16:16:20.804Z · LW(p) · GW(p)

The difference I was intending is:

  • The AI is intentionally given affordances by humans.
  • The AI gains power in a way which isn't desired by its creators/builders (likely subversively).
Replies from: ChristianKl
comment by ChristianKl · 2025-04-11T16:45:41.239Z · LW(p) · GW(p)

Plenty of things I desire happen without me intending for them to happen.

In general, the reference class is misalignment of agents and AIs aren't the only agents. We can look at how the terms works in a corporation.

There are certain powers that a CEO of a big corporation intentionally delegates to a mid-level manager. I think there's plenty that a CEO appreciates his mid-level manager to do that the CEO does not explicitly task the mid-level manager to do. The CEO likely appreciates if the mid-level manager autonomously takes charge of solving problems without bothering the CEO about it. 

On the other hand, there are also ways where the mid-level manager does company politics and engineers a situation so that the CEO giving the mid-level manager certain powers that the CEO doesn't desire to give the mid-level manager. The CEO feels forced to do so because of how the company politics play out, so the CEO does intentionally give the powers out to the mid-level manager. 

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-04-11T17:28:36.589Z · LW(p) · GW(p)

Sure, there might be a spectrum, (though I do think some cases are quite clear cut), but I still think the distinction is useful.

Replies from: ChristianKl
comment by ChristianKl · 2025-04-12T09:54:01.830Z · LW(p) · GW(p)

I don't think it's a spectrum. A spectrum is something one-dimensional. The problem with your distinction is that someone might think that there are safe from problems arising from power seeking (in a more broad definition) if they prevent the AI from doing things that they don't desire. 

There are probably three variables that matter:

  1. How much agency has the human in the interaction.
  2. How much agency has the AI agent in the interaction.
  3. Does the AI cooperate or defect in the game theoretic sense.

If you have a low agency CEO and a high agency very smart middle manager that always cooperates, that middle manager can still acquire more and more power over the organization.

comment by Charlie Steiner · 2025-04-11T04:20:15.905Z · LW(p) · GW(p)

I'm big on point #2 feeding into point #1.

"Alignment," used in a way where current AI is aligned - a sort of "it does basically what we want, within its capabilities, with some occasional mistakes that don't cause much harm" sort of alignment - is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.