Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

post by Vika, Vikrant Varma (amrav), Ramana Kumar (ramana-kumar), Rohin Shah (rohinmshah) · 2022-11-25T14:36:08.948Z · LW · GW · 9 comments

This is a link post for https://vkrakovna.wordpress.com/2022/11/25/refining-the-sharp-left-turn-threat-model/

Contents

  Plan: we use alignment techniques to find a goal-aligned model before SLT occurs, and the model preserves its goals during the SLT. 
  Step 1: Finding a goal-aligned model before SLT
  Step 2: The goal-aligned model preserves its goals during SLT (with some help from us)
  Takeaways
None
9 comments

Sharp Left Turn [LW · GW] (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling). This post will outline our current understanding of the most promising plan for getting through an SLT and how it could fail (conditional on an SLT occurring).

In a previous post [LW · GW], we broke down the SLT threat model into 3 claims:

  1. Capabilities will generalize far (i.e. to many domains)
  2. Alignment techniques that worked previously will fail during this transition
  3. Humans can’t intervene to prevent or align this transition 

We proposed some possible mechanisms for Claim 1, while this post will investigate possible arguments and mechanisms for Claim 2. 

Plan: we use alignment techniques to find a goal-aligned model before SLT occurs, and the model preserves its goals during the SLT. 

We can try to learn a goal-aligned [LW · GW] model before SLT occurs: a model that has beneficial goals and is able to reason about its own goals. This requires the model to have two properties: goal-directedness towards beneficial goals, and situational awareness (which enables the model to reason about its goals). Here we use the term "goal-directedness" in a weak sense (that includes humans and allows incoherent preferences) rather than a strong sense (that implies expected utility maximization). 

One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments [LW · GW] for alignment techniques failing during an SLT don't imply that the plan fails (e.g. it might be fine if interpretability or ELK techniques no longer work reliably during the transition if we can trust the model to manage the transition). 

Step 1: Finding a goal-aligned model before SLT

We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It's important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment. 

Thus, our model search process would follow a decision tree along these lines:

Here are some ways alignment techniques could be applied in this plan: 

Why alignment techniques could fail to find a goal-aligned model before SLT occurs:

It's also possible that this model search process would pose an overly high alignment tax to be followed in practice (see Claim 3). 

Step 2: The goal-aligned model preserves its goals during SLT (with some help from us)

This step of the plan relies on the goal-aligned model both wanting to preserve its goals during an SLT and being capable of preserving its goals. The general argument for a model wanting to preserve its own goals is that it has a convergent instrumental incentive to ensure that future versions of itself act in ways that satisfy its current preferences. 

One argument [LW · GW] for a pre-SLT system being capable of preserving its goals is by analogy to human values being preserved when humans went through the capability phase transition of developing language and writing. However, it is not clear how much the human analogy constitutes evidence for ability to preserve goals through a phase transition, since there are other possible explanations for human values being preserved (e.g. that values like "relationships" and "social status" continued to be selected for during the capability transition) that don't rely on humans being able to deliberately preserve their values.

A key question is whether the model could reflect on its goals without already having generalized capabilities. This is related to Claim 1a in our previous post: that the model's capabilities will all generalize at the same time. If this claim holds, in the sense that reflection on goals is part of this set of capabilities that all generalize simultaneously, then the ability to reflect on goals would only arise during the SLT and so this plan doesn't work. However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them. 

If the model can predict that an SLT is likely to modify its goals in an undesirable way, we expect that it would choose not to undergo an SLT in the first place (e.g. by avoiding new internal algorithms or ways of thinking that could lead to an SLT). 

Paul Christiano outlined [LW · GW] some specific ways for a goal-aligned model to preserve its goals during SLT depending on the mechanism for SLT:

"Aligning the internal search [or natural selection inside the model] seems very similar to aligning SGD on the outside. [...] Because the search is on the inside, we can't directly apply our alignment insights to align it. Instead we need to [use ELK to] ensure that SGD learns to align the search."

"If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we've been succeeding at alignment so far then the model will be trying to stay aligned."

"One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). [...] If you've succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success." 

We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard [LW · GW] and an SLT would not produce a completely coherent system. 

Some ways a goal-aligned model could fail to preserve its goals:

Some ways that humans could fail to help the model to preserve its goals:

Takeaways

The above is our current model of the most promising plan for managing an SLT and how it could fail. The overall takeaways are:

The core reasons to be skeptical of this plan are: 

If we missed any important components of this plan or ways it could fail, please let us know!

9 comments

Comments sorted by top scores.

comment by johnswentworth · 2022-11-25T17:27:22.886Z · LW(p) · GW(p)

One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments [LW · GW] for alignment techniques failing during an SLT don't imply that the plan fails...

I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn't actually the main problem which needs to be solved here.

(Or perhaps you're intentionally ignoring that problem by assuming "goal-alignment"?)

Replies from: Vika, ramana-kumar
comment by Vika · 2022-11-25T17:40:58.016Z · LW(p) · GW(p)

I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts. 

comment by Ramana Kumar (ramana-kumar) · 2022-11-25T17:38:39.888Z · LW(p) · GW(p)

I agree with you - and yes we ignore this problem by assuming goal-alignment. I think there's a lot riding on the pre-SLT model having "beneficial" goals.

Replies from: cfoster0
comment by cfoster0 · 2022-11-25T18:36:08.810Z · LW(p) · GW(p)

To the extent that this framing is correct, the "sharp left turn" concept does not seem all that decision-relevant, since all most of the work of aligning the system (at least on the human side) should've happened way before that point.

EDIT: "all" was too strong here

comment by Stuart_Armstrong · 2023-02-05T19:39:01.572Z · LW(p) · GW(p)

I'll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)

Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.

The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"

Replies from: Roman Leventov, amaury-lorin
comment by Roman Leventov · 2023-02-10T10:30:13.737Z · LW(p) · GW(p)

Agreed.

Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.

Another way of saying this is that inner alignment is more important than outer alignment.

The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

I've also called this "generalise properly" part methodological alignment in this comment [LW(p) · GW(p)]. And I conjectured that from methodological alignment and inner alignment, outer alignment follows automatically, we shouldn't even care about it. Which also seems like what you are saying here.

Replies from: Stuart_Armstrong
comment by Stuart_Armstrong · 2023-02-10T17:35:24.114Z · LW(p) · GW(p)

Another way of saying this is that inner alignment is more important than outer alignment.

Interesting. My intuition is the inner alignment has nothing to do with this problem. It seems that different people view the inner vs outer alignment distinction in different ways.

comment by momom2 (amaury-lorin) · 2023-08-06T15:04:48.366Z · LW(p) · GW(p)

For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

There is a critical step missing here, which is when the trade-bot makes a "choice" between maximising money or satisfying preferences.
At this point, I see two possibilities:

  • Modelling the trade-bot as an agent does not break down: the trade-bot has an objective which it tries to optimize, plausibly maximising money (since that is what it was trained for) and probably not satisfying human preferences (unless it had some reason to have that has an objective). 
    A comforting possibility is that it is corrigibly aligned, that it optimizes for a pointer to its best understanding of its developers. Do you think this is likely? If so, why?
  • An agentic description of the trade-bot is inadequate. The trade-bot is an adaptation-executer, it follows shards of value, or something. What kind of computation is it making that steers it towards satisfying human preferences?

So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"

This is a false dichotomy. Assuming that when the AI gains situational awareness, it will optimize for its developers' goals, alignment is already solved. Making the goals safe before situational awareness is not that hard: at that point, the AI is not capable enough for X-risk.
(A discussion of X-risk brought about by situationally unaware AIs could be interesting, such as a Christiano failure story, but Soares's model is not about it, since it assumes autonomous ASI.)

comment by Roman Leventov · 2022-12-22T17:50:57.236Z · LW(p) · GW(p)

situational awareness (which enables the model to reason about its goals)

Terminological note: intuitively, situational awareness means understanding oneself existing inside a training process. The ability to reason about one's own goals would be more appropriately called "(reflective) goal awareness".

We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It's important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment. 

Thus, our model search process would follow a decision tree along these lines:

  • If situational awareness is detected without goal-directedness, restart the search. 
  • If undesirable goal-directedness or early signs of deceptive alignment are detected, restart the search. 
  • If an upcoming phase transition in capabilities is detected, and the model is not goal-aligned, restart the search. 
  • If beneficial goal-directedness is detected without situational awareness, train the model for situational awareness. 

Given that by "situational awareness" you mean "goal awareness", what you are discussing in this section doesn't make a lot of sense because goal awareness = goal-directedness. Also, goal awareness/goal-directedness is almost identical to self-awareness.

Self-awareness is a gradual property of an agent. In DNNs, it can be stored as the degree and strength of the activation of the "self" feature when performing arbitrary inferences. And at this very moment the "basic goal" of self-evidencing, i. e., maximising the chance of finding oneself in one's optimal niche can be thought to appear. This means conforming to one's own beliefs about oneself, including one's beliefs about one's own goals.

Thus, the complete set of beliefs of the agent about oneself can be collectively seen as the agent's goals. In DNNs, these are manifested as the features connected to the "self" feature. For example, ChatGPT (more concretely, the transformer model behind it) has a feature of "virtual assistant" connected to the feature "ChatGPT". So, if ChatGPT was more than trivially self-aware (remember that self-awareness is gradual, so we can already assign a non-zero score at self-awareness to it), we would have to conclude that ChatGPT has a goal of being a virtual assistant.

Note that the fact that the feature "virtual assistant" is connected to the feature "ChatGPT" before the feature "ChatGPT" has emerged as the self feature (i. e., is robustly activated during the majority of inferences) doesn't mean that ChatGPT "has a goal" of being a virtual assistant before being self-aware: it doesn't make sense to talk about any goal-directness before self-awareness/self-evidencing at all.

See here [LW · GW] where I discuss these points from slightly different angles (also, that post has a lot of other relevant/related discussions, for example, about the odds of deceptive misalignment [LW · GW], which I hold extremely unlikely if adequate interpretability is possible and is deployed).

However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them.

  • Large language models may be an example of this as well, since they have some capacity to reflect on goals (if prompted accordingly) without generalized planning ability.

This discussion of the timeline of "generalisation" and "generalised planning" vs. goal awareness suffers a bit from the lack of the definition of "generalisation", but I strongly agree with the quoted part: goal awareness and the ability to reflect on goals is nothing more than basic self-awareness, plus any ability to think about one's own beliefs, i. e. sophisticated inference. Mastery of such intelligence capabilities as concept learning, logic, epistemology, rationality, and semantics which we all roughly think are included in the "general thinking" package are not automatically implied by sophisticated inference. We can mechanistically construct a simulated agent with two-level Active Inference and goal reflection but without some of these "general" capabilities.

We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard [LW · GW] and an SLT would not produce a completely coherent system.

Several thoughts on goal preservation:

  • General intelligence can be intimately related to picking up affordances, open-endedness, and preference learning (I identify the latter with the discipline of ethics).
  • To really ensure some goals get preserved across a huge shift in the mode of existence (which a sharp left turn will almost surely bring about for the AI agent that undergoes it), the agent must have a really strong prior of not changing the specific goal. While technically this could be possible, this could turn against us: it could guarantee perpetual goal lock-in after the SLT.