Activation Engineering Theories of Impact

post by kubanetics (jakub-nowak) · 2024-07-18T16:44:33.656Z · LW · GW · 1 comments

Contents

  ToIs
    Low-tax value alignment of LLMs
    Insight into model representations
    Defense against malicious inputs at inference time
    Razor-sharp control 
    Good enough control for AARs
None
1 comment

Below I summarize the thoughts of other people on what's the Theory of Impact for Activation Engineering. I mostly base it on the "discussion" parts of the papers and the answers under @Chris_Leong [LW · GW]'s post What's the theory of impact for activation vectors? [LW · GW]

Alex Turner's posts on controlling a maze-solving policy network [LW · GW], and a paper on steering GPT-2-XL by adding an activation vector [LW · GW], introduced activation engineering, a set of "techniques which steer models by modifying their activations". 

As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime.

Over a year later and there's a Slack server to coordinate research projects and propose new ideas or open problems in this area. This is, to my knowledge, the closest we have to a practical implementation of retargetting the search [AF · GW]. 

ToIs

Low-tax value alignment of LLMs

The original "activation addition" paper claims that

activation engineering can flexibly retarget LLM behavior without damaging general performance. We speculate that this involves changing the model’s currently-active (mixture of) goals. Suitably developed, the activation engineering approach could enable safety progress while incurring a very low ‘alignment tax’ 

Alex Turner claims (as far as I understand) that that steering vectors can significantly enhance a model's performance on key aspects of safety, such as truthfulness, reducing the tendency to hallucinate or generate false information, minimizing sycophancy or overly agreeable responses, discouraging power-seeking behavior, and mitigating myopic tendencies.

Activation vectors don't work in isolation; they can be effectively combined with existing techniques like prompting and fine-tuning, leading to even greater improvements. This means that activation vectors may represent a valuable addition to the growing arsenal of tools available to AI researchers and developers striving to create more aligned and beneficial AI systems.

Insight into model representations

Most of productive research in this area will tell us something about how neural networks work and this seems to be a net positive unless it capabilities advancement offset the benefits to safety [LW · GW]. This is the same dilemma that we have in case of mechanistic interpretability [LW · GW].

Activation engineering could specifically be used as a tool for top-down interpretability [LW(p) · GW(p)] in a similar way activation patching/ablation is used for mechanistic interpretability. 

This then would bring us to retargetting the search [AF · GW] and back, and we might iteratively improve on both in a constant "interpretabilitysteering" loop. This could lead to a new technique that builds upon activation vectors, but is a more powerful alignment tool.

Some recent safety techniques inspired by representation engineering include Representation Misdirection for Unlearning (RMU) and Short Circuiting for adversarial robustness.

Defense against malicious inputs at inference time

Steering vectors offer a last line of defense against AI misuse risk by giving us control over model behavior at the last possible step - during inference. 

Real-time corrections could prevent harmful or unintended outputs, even in the face of malicious attempts to manipulate the model, like prompt-injections or jail-breaks.

Razor-sharp control 

Similarly Turner claimed that the advantage of ActEng over techniques like RLHF is that activation engineering could help us avoid failure modes of optimization based on human feedback [LW(p) · GW(p)].

I like the "single-bit edits" analogy [LW(p) · GW(p)] provided by @mishajw [LW · GW]. Traditional methods like pre-training or fine-tuning change many parts of the program at once, making it hard to predict how the behavior will be affected. Steering vectors, on the other hand, allow us to isolate and modify specific aspects, potentially making it safer and more predictable. 

This way, we avoid further training that might result in new circuits being learned.

Good enough control for AARs

Some speculate [LW(p) · GW(p)] that more advanced (robust to distribution shifts) AIs should converge towards having almost same causal world models which should be reflected in linear structures inside the network. Therefore we might expect linear activation/representation engineering methods to work the same, or even better, in those more powerful models. But activation engineering does not have to live up to this expecation and be a silver bullet remedy. 

However, it might be a sufficient alignment technique for ~human-level automated alignment researchers [LW(p) · GW(p)](AARs).  This could lead to a virtuous cycle where human-AI research teams [LW(p) · GW(p)] become better at aligning bigger models.

For that purpose, steering vectors may not need to be exceptionally robust if combined with other alignment techniques in a Swiss cheese model approach to improve overall safety.

1 comments

Comments sorted by top scores.

comment by Jan Wehner · 2024-07-19T09:08:45.385Z · LW(p) · GW(p)

Thanks for writing this, I think it's great to spell out the ToI behind this research direction!

You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is "thinking" about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.

You might also be interested in my (less thorough) summary [AF · GW]of the ToIs for Activation Engineering.