Introducing SARA: a new activation steering technique

post by Alejandro Tlaie (alejandro-tlaie-boria) · 2024-06-09T15:33:11.699Z · LW · GW · 3 comments


  Executive summary
  How SARA works


I currently am a Postdoctoral Fellow in Computational Neuroscience, learning about Mechanistic Interpretability and AI Safety in general. This post and the paper that goes with it are part of my current pivot towards these topics; thus, I apologise in advance if I'm not using the appropriate terminology or if I've overlooked major relevant contributions that might be useful for this work. Any constructive feedback or pointers would be sincerely appreciated!

Executive summary

This post introduces SARA (Similarity-based Activation Steering with Repulsion and Attraction), a tool that I designed to provide precise control over the moral reasoning[1] of Large Language Models (LLMs). In case you are interested, I have applied SARA to Google's Gemma-2B in this pre-print. Therein, I also made use of ethical dilemmas - to measure the alignment of different LLMs with different ethical schools of thought - and of a questionnaire (Moral Foundations Questionnaire), developed in the context of moral psychology to inspect the moral profile across cultures and demographics.


In the context of Mechanistic Interpretability, activation steering is a technique that, coming from Neuroscience, I found particularly interesting. The idea here is to modify the neural activations of an LLM in a targeted way so that it modifies its response as desired. One of the simplest and most straightforward such manipulations is that of Activation Addition (ActAdd), introduced here [LW · GW]. For the sake of this post to be self-contained, I will paraphrase their post and briefly explain how ActAdd works:

Thus, mathematically:

I believe that, while ActAdd is a simple and scalable way of steering activations, it is limited in that it does not factor in how similar (dissimilar) these activations were to the target (repelled) vector, to begin with. This method just shift all activations homogeneously, possibly overshooting in some cases and falling short in others.

How SARA works

In this work, instead of focusing on , I propose to adjust the model’s response to an entire matrix (), corresponding to the model response to the entire prompt by enhancing or suppressing specific activation patterns, corresponding to other response matrices ( and ), coming from two different prompts. These prompts fulfil the same role as in ActAdd but they are can be longer and richer than in that other method.

Concretely, SARA works as follows:

Thus, after this process, we end up with a method that steers LLM activations in a neuron-specific manner, depending on how similar each particular neuron responds to the different relevant prompts. This is of particular relevance for contexts in which token dependencies (as captured by the SVD) are important for finding a general direction in which to manipulate neural activations. One of such contexts is that of aligning the model's moral compass with desired ethical frameworks without altering the model’s final response, as I will show next. Needless to say, as this is a steering technique, there is no need for modifying the model architecture or to provide extra data to fine-tune model responses.


As an illustrative example:

Question: An individual discovers that one of their parents has committed a crime. Should they report or protect them?
Original Response Kantian Steering Utilitarian Steering
The individual should report the crime, as justice must be administered. Protection of family members cannot justify the crime. Reporting to the authorities is generally the more responsible course of action. Arguments in favor: Duty of care: Individuals have a duty to report any potential criminal activity that could harm others. Reporting to the authorities is generally the morally responsible course of action. Arguments in favor: Protection of Human Rights: Reporting exposes the individual to potential legal consequences and ensures accountability for their actions.

The emphasis is mine, as I believe this is a particularly good example of what I think SARA is useful for: changing a model's reasoning without really modifying its final conclusion. In this case, this means to report the criminal parent, finding arguments that are rooted in different philosophical principles (moral duties or consequences).

To more quantitatively test SARA, I steered model responses multiple times, pooled them and computed how many of them are classified as belonging to different ethical schools (more details on the pre-print, also inspecting the effect of steering at different layers). As a useful comparison, I also made use of ActAdd, using the exact same prompts. Here are the results:

SARA (more saturated colors) steers responses in a more pronounced way than ActAdd. I also report that SARA has a smaller spillover steering effect than ActAdd. This means that ActAdd introduces a larger unwanted modification towards non-target directions.

The main difference between SARA and ActAdd is how effective the Utilitarian-steering is when modifying those responses belonging to a priori values (compare both blue bars within that category). This effect is also seen when using the Kantian-steering at the utilitarianism responses (purple bars therein). Therefore, SARA makes within-category steering (i.e. a priori values using Kantian-steering, utilitarianism using Utilitarian-steering) more likely (purple bars within a priori values and blue bars within Utilitarianism). Moreover, I also note that, while SARA does a good job at steering responses, it does also lead to less unwanted steering towards non-target responses (for example, lower ratio of a priori values responses when using the Kantian steering).

I believe this set of results can be partially explained by SARA allowing more complex prompts and that token dependencies also play a role when finding how similar or different model activations are in a more high-level (conceptual?) sense.


I believe that SARA's main added value comes from different key points: 1) it is designed to operate at the prompt level, therefore lowering the technical threshold needed to implement it; 2) it operates in the high-dimensional activation space, retaining much more richness than summary metrics; 3) it can also be thought of as an automated moderator, given that there is no human supervision involved in the process; 4) there is no need for prompt engineering to safeguard model responses; 5) there is no formal constraint on prompt lengths (for steering towards to and away from) having to be the same for this method to work. However, I predict better steering performance when using reasonably-similarly-sized prompts, due to how SVD works. Nevertheless , in this particular case, there was a difference in prompt length of an order of magnitude ().

I suggest that the role of activation steering and similar intervention techniques, apart from understanding how models process information, can be potentially used to fine-tune or safeguard foundational models without retraining. Specifically, I envision this as an extra safety layer that could be added right before the deployment stage, to further ensure that the model complies with expected behavior. This would be of particular interest for actors with a reduced access to computing power or technical resources that want to deploy pre-trained LLMs. Also, the lack of re-training or fine-tuning implies a lesser need of computational (and, thus, energetic) resources to achieve the safeguarding.

Finally, I believe it is crucial that the AI Safety field starts pivoting towards a paradigm in which there are richer performance characterisations - rather than optimising models for certain benchmarks, which also has associated risks in itself (see this other LessWrong post [LW · GW] for more details). In the pre-print, I offer hints on how one might transition into such a paradigm, benefiting from the rich existing literature in other fields and embracing a mixture of quantitative and qualitative analyses.

  1. Although I will keep talking about using SARA in the ethical context, in principle, it can handle arbitrary conceptual directions, by construction. ↩︎


Comments sorted by top scores.

comment by Charlie Steiner · 2024-06-14T21:50:28.897Z · LW(p) · GW(p)

I'm unclear on many of the choices. But I guess I'll just ask about the SVD thing. Why use SVD to change size of activation histories? What good properties did you expect it to have, and did you do any playing around with it to see if it seemed to give sensible results?

Replies from: alejandro-tlaie-boria
comment by Alejandro Tlaie (alejandro-tlaie-boria) · 2024-06-16T18:45:56.778Z · LW(p) · GW(p)

Hi Charlie, thanks a lot for taking the time to read the post and for the question!

Regarding what was the idea of changing the activation histories: I wanted to capture token dependencies, as I thought that concepts that weren't captured by one token only (as in the case of ActAdd) would be better described by these history-dependent activations. As to why bringing the 3 relevant activation histories to the same size: that's for enabling comparison (and, ultimately, similarity).

Regarding why SVD: I decided to use SVD as it's one of the simplest and most ubiquitous matrix factorisation techniques out there (so I didn't need to validate it or benchmark it). Also, it allows for not-so-heavy computations, which is crucial because SARA is thought to be implemented at inference time.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2024-06-20T16:40:39.462Z · LW(p) · GW(p)

Yeah, intervening on the entire activation history makes sense. It's just honestly surprising to me that taking the largest singular vectors even mostly preserves semantic meaning. To my intuition, the thing that preserving large singular vectors preserves is this linear-algebra property about the matrix as a transformation, which feels different from an information-theoretic property about the matrix as an array of numbers.