Agency overhang as a proxy for Sharp left turn

anton-zheltoukhov

Agency overhang as a proxy for Sharp left turn

post by Eris (anton-zheltoukhov), Iuliia Levin (iuliia-levin) · 2024-11-07T12:14:24.333Z · LW · GW · 0 comments

  Summary
  The non-summary
    Theory of change
    Project plan
    Backup plan
  Output
    Minimum Viable Product
  Risks and downsides
  Acknowledgements
  Team
    Research Lead. Anton Zheltoukhov
    Roles and skill requirements
      Prompt engineer
      Interpretability engineer
      Conceptual researcher
    Team size
  Reference set
    SLT
    Evals
    Elicitation methods
    Overhang
None
No comments

I've been accepted as a mentor for the next AI Safety Camp. You can apply to work with me and the team.

The deadline for applicants is November 17.

The program will run from January 11 to April 27.

Summary

Core underlying hypothesis - we believe that there is a significant agency overhang [LW · GW] in the modern LLMs, meaning there is a potential for performance of a model to increase significantly with introduction of more powerful elicitation/scaffolding methods without additional improvements of model itself, due to prompting and scaffolding techniques being in their early ages. For the model evaluations this means that the current evaluations systematically undershoot the real level of capabilities and by extension, the level of risks involved.

We see several important research questions that have to be answered:

Is the core assumption even true? We want to prove that one can elicit the peak performance using narrow highly specialised prompts and scaffoldings and locally beat general state-of-the-art performance
How overhang should be factored in in the overall model evaluation procedure?
Is it possible to estimate the real level of overhang (e.g. developing an evaluation technique measuring the gap between current sota performance and theoretically possible peak performance)
How big of an increase has been introduced with existing scaffolding techniques?

We are going to decide on which exact paths to pursue later. We expect to remain flexible and shift between paths when presented with new evidence.

The non-summary

This vector of research has been born as an attempt at tackling the problem known as Sharp left turn.

Sharp Left Turn (SLT) is a highly probable situation [LW · GW] that can occur in LLMs when the growth of generalisation ability outpaces the growth of alignment measures which leads to those measures rendered ineffectual which in turn may lead to catastrophic consequences.

Assuming we are going to continue with the transformer + scalable oversight + RLHF paradigm you can imagine SLT as follows. There is a state graph of a model. Via fine tuning we prune out the paths leading towards dangerous states. The generalisation here can be viewed as an increase in the number of paths between any two nodes. In this sense SLT might be viewed as an inability to identify and prune out the new dangerous path at the same rate as they are being introduced.

This is the connection between SLT and scaffolding overhang. Whichever scenario of SLT is more probable, it’s gonna happen on this territory between addressable and thus prunable states and the unaddressed peak states.

By many influential safety researchers, SLT is considered as one of The hard bits of alignment, a critical problem that has to be resolved in order to give a chance for successful ASI.

There are many possibilities how SLT may occur. Here we are trying to address only one possible route. Victoria Krakovna and researchers from MIRI made a great analysis of the threat model. Excerpted from Refining the Sharp Left Turn threat model, part 1: claims and mechanisms [LW · GW]:

Mechanisms for a rapid phase transition
A rapid phase transition happens if there is a capability overhang: the AI system is improving at various skills continuously, but its improvement in many domains is bottlenecked on one specific skill, and at some point it receives some input that makes its existing capabilities much more effective. Here are some ways this can happen:
Analogy to few-shot prompting: the capabilities are already present in the trained artefact. Any alignment technique that goes through gradient updates becomes irrelevant. Putting the artefact into the “right” situation (e.g., giving it a few-shot prompt) reveals its capabilities relevant to this situation. Mechanism: the relevant knowledge and capabilities are installed by some generic pre training optimisation process.

We’ve preliminary considered many angles from which to approach the problem. Focus on eliciting peak capabilities and consequent analysis of the resulting leap in capabilities seems like the best approach.

Note. The perspective on SLT we gave is not the one used by MIRI. This is intentional. (to the best of my knowledge) The focus of their model of SLT is on the shape of capabilities landscape and that the properties leading to the highest performance are the same leading to the treacherous actions. We think this is not a useful operationalisation of the dynamic in the current situation. Instead we aim to (eventually) build a mechanistic model rooted in the current ml paradigm and later on build a conceptual bridge between the two.

Theory of change

Successfully proving that there is a significant margin to be gained using only existing methods can cause a change in the perspective of the governance sector, namely it can brush off the somewhat pristine picture given by the current evaluations measures.

Project plan

We're gonna start with a literature review of the latest elicitation methods
1. We are going to investigate current three leading hypothesis of which types of methods lead to peak capabilities:
  1. Domain-specific prompts [ref]
  2. Better meta thinking strategies based on the notion of model organisms and how to integrate them efficiently [ref]
  3. Prompt generators
The second stage is dedicated to experiments and building a base of precedents
The next stage is about trying to identify generalisable clusters of precedents ranked by the increase in performance compared to default elicitation methods. The goal here is to build a model of an error margin for sota evaluation methods
(longshot) We're gonna try building the shape of the peak elicitation pipeline to estimate the theoretical limit of current capabilities. Our current best bet is on chain (council) of LLMs specialised in prompting
Optional track - building a map of which alignment agendas contribute to preventing SLT

Backup plan

We’ve combined a pretty flexible list of possible approaches to the problem. We expect to shift between them when necessary.

Output

Desired shape of the result is a private report shareable only with trusted researchers and labs

Minimum Viable Product

The goal is to make a serious attempt at beating sota capability results using more narrow highly specialised prompts / scaffoldings.

Existence of a significant amount of such successes would effectively mean that some (many?) of the current evaluations systematically underestimate the real capabilities of LLMs.

Risks and downsides

Developing new prompting methods potentially may lead to progress in AI capabilities.

Acknowledgements

This research proposal has been developed in close collaboration with Iulia Levin. Iulia’s contribution has been invaluable. We will continue working on this project together. From the perspective of AISC she is an external member of the team.

Team

Research Lead. Anton Zheltoukhov

~9 years of LW exposure =)

Finished AISC (Positive Attractors team led by Robert Kralisch [LW · GW]),

Finished ARENA.

On and off working on a personal conceptual blue-sky-like agenda called Narrative Theory. It has been partially published on LW [LW · GW].

Have 6 years in tech as dev/qa under my belt.

Time commitment: 15-20 hours per week

Roles and skill requirements

Prompt engineer

The main goal for this role is to explore various prompting techniques, develop new ones, and analyse observation.

Coding experience is a must. Formal ML experience would be great but it is not a deal breaker.

Candidates have to have a good understanding of how transform works, familiar with prompting techniques (e.g. COT, ).

Interpretability engineer

The main goal for this role is same as for Prompt engineer but focus is on “invasive” elicitation methods (e.g. activation steering [? · GW], ...)

On top of requirements for Prompt engineer there is also a requirement for mech interp experience.

Conceptual researcher

The main goal for this role differs from the former ones - it is to try to deconfuse SLT and develop a mechanistic model for it.

Requirements: great conceptual thinking and research skills in general (in ML preferably), strong security mindset, familiarity with threat models landscape

Agency overhang as a proxy for Sharp left turn

Contents

Summary

The non-summary

Theory of change

Project plan

Backup plan

Output

Minimum Viable Product

Risks and downsides

Acknowledgements

Team

Research Lead. Anton Zheltoukhov

Roles and skill requirements

Team size

Reference set

SLT

Evals

Elicitation methods

Overhang

0 comments