Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

post by Vika, Vikrant Varma (amrav), Ramana Kumar (ramana-kumar), Mary Phuong (mary-phuong) · 2022-08-12T15:17:38.304Z · LW · GW · 4 comments

This is a link post for https://vkrakovna.wordpress.com/2022/11/25/refining-the-sharp-left-turn-threat-model/

Contents

  What are the main claims of the “sharp left turn” threat model?
    Claim 1. Capabilities will generalize far (i.e., to many domains)
    Claim 1a [Optional]: Capabilities (in different "domains") will all generalize at the same time
    Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously) 
    Claim 2. Alignment techniques that worked previously will fail during this transition
    Claim 3: Humans can’t intervene to prevent or align this transition 
  Arguments for the claims in this threat model
  Mechanisms for capabilities generalizing far (Claim 1)
  Mechanisms for a rapid phase transition (Claim 1b)
None
4 comments

This is our current distillation of the sharp left turn [LW · GW] threat model and an attempt to make it more concrete. We will discuss our understanding of the claims made in this threat model, and propose some mechanisms for how a sharp left turn could happen. This is a work in progress, and we welcome feedback and corrections. 

What are the main claims of the “sharp left turn” threat model?

Claim 1. Capabilities will generalize far (i.e., to many domains)

There is an AI system that:

Generalization is a key component of this threat model because we're not going to directly train an AI system for the task of disempowering humanity, so for the system to be good at this task, the capabilities it develops during training need to be more broadly applicable. 

Some optional sub-claims can be made that increase the risk level of the threat model:

Claim 1a [Optional]: Capabilities (in different "domains") will all generalize at the same time

Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously) 

Claim 2. Alignment techniques that worked previously will fail during this transition

Claim 3: Humans can’t intervene to prevent or align this transition 

Arguments for the claims in this threat model

Mechanisms for capabilities generalizing far (Claim 1)

Capabilities will generalize far if learning / reasoning / cognitive work is done outside of the optimization process, similarly to how human cultural evolution happens outside genetic evolution and proceeds faster. Here are some mechanisms for cognitive work getting done outside the optimization process:

Mechanisms for a rapid phase transition (Claim 1b)

A rapid phase transition happens if there is a capability overhang: the AI system is improving at various skills continuously, but its improvement in many domains is bottlenecked on one specific skill, and at some point it receives some input that makes its existing capabilities much more effective. Here are some ways this can happen: 

We will discuss mechanisms for Claim 2 in a future post [LW · GW]. 

4 comments

Comments sorted by top scores.

comment by Vika · 2023-12-20T19:52:17.210Z · LW(p) · GW(p)

I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.

This post could be improved by explicitly relating the claims to the "consensus" threat model [LW · GW] summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims: 

  • Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim 2 (alignment techniques stop working). 
  • It probably relies on some weaker version of Claim 2 (alignment techniques failing to apply to more powerful systems in some way). This seems necessary for deceptive alignment to arise, e.g. if our interpretability techniques fail to detect deceptive reasoning. However, I expect that most ways this could happen would not be due to the alignment techniques being fundamentally inadequate for the capability transition to more powerful systems (the strong version of Claim 2 used in SLT).
comment by joshc (joshua-clymer) · 2022-08-14T18:57:37.651Z · LW(p) · GW(p)

Claim 1: there is an AI system that (1) performs well ... (2) generalizes far outside of its training distribution.

Don't humans provide an existence proof of this? The point about there being a 'core' of general intelligence seems unnecessary.

Replies from: ramana-kumar, weverka
comment by Ramana Kumar (ramana-kumar) · 2022-08-17T14:31:26.123Z · LW(p) · GW(p)

I agree that humans satisfying the conditions of claim 1 is an argument in favour of it being possible to build machines that do the same. A couple of points: I think the threat model would posit the core of general intelligence as the reason both why humans can do these things and why the first AGI we build might also do these things. Claim 1 should perhaps be more clear that it's not just saying such an AI design is possible, but that it's likely to be found and built.

comment by weverka · 2022-11-25T16:18:11.545Z · LW(p) · GW(p)

No, Humans do not satisfy this assumptions adopted here, unless you make this more specific.

The definition of Generalize is given above as: "Generalizes, i.e., performs well in new domains, which were not optimized for during training, with no domain-specific tuning".

Whether you think humans do this depends on what you take for "new domains" and "perform well".  

Humans taught to crawl on hardwood floors can crawl on carpeted floors.  Humans taught to hunt fly larva will need further training to hunt big game.