An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
post by Jan Wehner · 2024-07-14T10:37:21.544Z · LW · GW · 5 commentsContents
What is Representation Engineering? Goals Method What has been done so far? Initial work Improved techniques Applications: Related areas Probing Activation Patching Model Editing Prompting Safety fine-tuning Why does it work? How could this help with AI Safety? Personal opinions Open Problems and Limitations Conclusion References None 5 comments
Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret representations within networks and to detect the formation and use of concepts during inference. This post serves as an introduction to Representation Engineering (RE). We explain the core techniques, survey the literature, contrast RE to related techniques, hypothesise why it works, argue how it’s helpful for AI Safety and lay out some research frontiers.
Disclaimer: I am no expert in the area, claims are based on a ~3 weeks Deep Dive into the topic.
What is Representation Engineering?
Goals
Representation Engineering is a set of methods to understand and control the behaviour of LLMs. This is done by first identifying a linear direction in the activations that are related to a specific concept [3], a type of behaviour, or a function, which we call the concept vector. During the forward pass, the similarity of activations to the concept vector can help to detect the presence or absence of this concept direction. Furthermore, the concept vector can be added to the activations during the forward pass to steer the behaviour of the LLM towards the concept direction. In the following, I refer to concepts and concept vectors, but this can also refer to behaviours or functions that we want to steer.
This presents a new approach for interpreting NNs on the level of internal representations, instead of studying outputs or mechanistically analysing the network. This top-down frame of analysis might pose a solution to problems such as detecting deceptive behaviour or identifying harmful representations without a need to mechanistically understand the model in terms of low-level circuits [4, 24]. For example, RE has been used as a lie detector [6] and for detecting jailbreak attacks [11]. Furthermore, it offers a novel way to control the behaviour of LLMs. Whereas current approaches for aligning LLM behaviour control the weights (fine-tuning) or the inputs (prompting), Representation Engineering directly intervenes on the activations during the forward pass allowing for efficient and fine-grained control. This is broadly applicable for example for reducing sycophancy [17] or aligning LLMs with human preferences [19].
This method operates at the level of representations. This refers to the vectors in the activations of an LLM that are associated with a concept, behaviour or task. Golechha and Dao [24] as well as Zou et al. [4] argue that interpreting representations is a more effective paradigm for understanding and aligning LLMs than the circuit-level analysis popular in Mechanistic Interpretability (MI). This is because MI might not be scalable for understanding large, complex systems, while RE allows the study of emergent structures in LLMs that can be distributed.
Method
Methods for Representation Engineering have two important parts, Reading and Steering. Representation Reading derives a vector from the activations that capture how the model represents human-aligned concepts like honesty and Representation Steering changes the activations with that vector to suppress or promote that concept in the outputs.
For Representation Reading one needs to design inputs, read out the activations and derive the vector representing a concept of interest from those activations. Firstly one devises inputs that contrast each other wrt the concept. For example, the prompts might encourage the model to be honest or dishonest or they might be examples of harmful or harmless responses. Often these prompts are provided in contrastive pairs of inputs that only differ with regard to the concept. Secondly, one feeds these inputs into the network and records the activations. Some papers focus on the activations of specific tokens, while others take the activations across all tokens. Lastly, one finds a linear vector that describes the difference between activations for inputs of different classes. Methods differ in how they derive this vector and from which part of the architecture it is taken from.
For Representation Steering the concept vector is used to change the activations of the LLM during a forward pass. This can be done by simply adding the vector to the activations or projecting the vector onto the activations. Again methods differ in which part of the architecture and at which token position the vector is inserted.
Lastly, some papers also use the concept vector to detect the occurrence of a concept from the activations during a forward pass. This is done by comparing how similar these activations are to the previously derived activation vector based on the concept of interest.
While this post focuses on Representation Engineering methods as described above, other variants of Activation Engineering are not targeted towards specific concepts [LW · GW], fine-tuned towards specific outputs or derived from Sparse [LW · GW] Auto [AF · GW] Encoders.
What has been done so far?
Note: this is not a comprehensive survey
Initial work
While there had been a few previous works that identified where concepts were represented and attempted to manipulate them, this was largely motivated by validating interpretability findings and not intended for controlling high-level behaviour [1,2]. Furthermore, previous work found steering vectors targeted towards changes in model outputs. For example to cause a model to output an exact desired sentence [15] or to find and edit factual associations in an LLM [16]. Lastly, there has been work on engineering the activations of GANs [37], whereas we focus on LLMs.
Turner et al. [3] were the first to explicitly extract vectors associated with concepts from the activations and use them to steer the behaviour of an LLM. A seminal follow-up comes from Zou et al. [4] who propose a Linear Artificial Tomography for Representation Reading and Representation Control for steering concepts and demonstrate its efficacy on a larger range of high-level concepts. Among others, they use it to build a lie detector, decrease power-seeking behaviour and study the representation of ethics in LLMs.
Improved techniques
Improved techniques: Further works have refined the techniques for RE. Qiu et al. [5] develop spectral editing for activations, which projects representations to have high covariance with positive demonstrations and low covariance with negative demonstrations. This makes RE more efficient and applicable to non-linear activations. Von Rütte et al. [12] compare multiple approaches for reading the concept vector from representations and extend it to more concepts, finding that some concepts are not steerable. Jorgensen [13] found that mean-centering as a method of reading the concept vector from representations results in more effective steering vectors. Rimsky et al. [17] investigate Contrastive Activation Addition and find that it provides benefits on top of safety fine-tuning for various alignment-relevant concepts. Stickland et al. [29] reduce RE’s impact on model performance by only the steering vector in necessary situations and reducing KL divergence between a steered and unsteered model. Luo et al. [30] build a large dictionary of concept vectors and decompose activations in a forward pass as a sparse linear combination of concept vectors.
Miscellaneous: Multiple papers make use of concept vectors for fine-tuning a model. For example, Yin et al. [14] use RE to identify which parts of the network are relevant to a concept to then fine-tune only this part and Ackerman [28] [AF · GW] fine-tunes representations to be closer to a read-out concept vector.
Trained steering vectors: There exists a separate class of methods for deriving steering vectors, where the vector is optimised to cause certain output behaviours instead of being read out from contrastive inputs. For example, Cao et al. [23] find steering vectors that make preferred behaviours more likely than dispreferred behaviours, which presents an efficient and personalisable method for preference alignment. Rahn et al. [20] identify an activation vector that increases the entropy in the distribution of actions an LLM Agent selects, thus controlling its certainty about the optimal action.
Applications:
Alignment: RE generally gives us more control over models and has thus been used to improve the alignment of LLMs towards human preferences and ethics. In the first work to use RE for AI Alignment, Turner et al. [40] [AF · GW] succeeded in changing the goal pursued by an RL agent by clamping a single activation. Liu et al. [19] use RE to identify the differences in activations between preferred and dispreferred outputs and use the derived concept vector to steer the model according to a human’s preferences. Tlaie [21] steer model activations to influence the reasoning of a model towards utilitarian principles or virtue ethics.
ML Security: Work in ML security frames RE as a white-box attack which can be used to elicit harmful behaviours in open-weight LLMs for jailbreaking [8, 9] or red-teaming [10] by reading and steering a concept vector of harmful representations. These works indicate that RE is a very hard-to-defend attack against open-source models, especially when the system prompt can be removed. Further work finds that different jailbreaking methods stimulate the same harmful vector, which indicates that different jailbreaks operate via similar internal mechanisms [26]. Arditi et al. [27] identify a vector that successfully steers the refusal rate to harmful and harmless requests. Zou et al. [22] employ RE to improve the robustness of alignment to adversarial attacks by rerouting harmful representations into circuit breakers.
Honesty & Truthfulness: Multiple projects employ RE to increase honesty or truthfulness in models. These works reduce the rate of hallucinations [7] and improve performance on the TruthfulQA benchmark [4,6]. Zou et al. [4] use the similarity of an honesty vector with the activations during a forward pass to detect whether the model is lying. Marks & Tegmark [39] use linear probes to a truth dimension that can steer an LLM to treat false statements as true and vice versa. Furthermore, RE has been used to control the level of sycophancy of a model [17].
In-context learning: Todd et al. [32] identify function vectors that capture the task carried out by in-context examples from their activations. Similarly, Liu et al [18] derive an in-context vector from the representations of in-context examples which can be condensed to achieve similar learning behaviour. This indicates that ICL partially works by shifting the representations to the correct task. Li et al. [31] further extend this by improving the derived in-context vector via inner and momentum optimisation.
Interpretability: Many RE papers claim to offer some insight into the model’s representation of the concepts they steer since the fact that RE is able to steer the concept implies that it is represented linearly. For example, RE has been used to identify societal biases in an LLM and validate these findings by suppressing and promoting bias [11]. Similarly, it was found that different jailbreak methods stimulate the same activation vector that corresponds to the refusal concept and effectively steers refusal behaviour [26].
The diversity of applications for RE shows how powerful and general it is as a paradigm for studying LLM representations and steering LLM behaviour.
Related areas
Representation Engineering is related to several areas in Machine Learning. We first discuss methods related to representation reading before comparing representation steering to some alternatives.
Probing
Linear Probing attempts to learn a linear classifier that predicts the presence of a concept based on the activations of the model [33]. It is similar to representation reading in that it also learns a linear direction in activation space related to the concept.
However, Probing requires labels about the presence of a concept, while RE can work in a largely unsupervised fashion by relying on the LM’s own understanding of the concept. While some implementations of RE make use of labelled datasets, others construct contrastive inputs by simply changing the pre-prompt to an input. This saves effort for human labellers and limits applications to concepts that humans can effectively label. Furthermore, RE relies on the LM’s own understanding of the concept and thus avoids biasing the search to the human’s understanding of the concept. Lastly, RE can verify that the identified vector does represent the concept by suppressing/promoting it to achieve the desired behaviour change.
Activation Patching
Activation patching is a technique where the activations of a specific layer or neuron during a forward pass are replaced by activations from another input, by the mean activation over many inputs or by zero [34].
The most salient difference is that RE doesn’t completely change the activations, but merely adds to them. Thus, RE aims to find how a certain concept is represented, while activation patching serves as an ablation of the function of a specific layer or neuron.
Model Editing
Model Editing attempts to modify the weights of an NN without training to change its behaviour. Some works attempt to Edit Factual Associations [35], while others perform Task Arithmetic to steer LLMs to carry out certain tasks by adding a task vector to the model's weights [38]. Similarly, RE has been used to edit facts [4] or steer models towards carrying out tasks [18]. However, the approaches differ since Model Editing modifies the model’s weights, while RE directly intervenes on the activations during the forward pass.
Prompting
Changing the (system-)prompt given to an LLM is the most straightforward and commonly used way to steer the behaviour of a model. However, it does so without any changes to weights or activations and does not provide any interpretability benefits.
Safety fine-tuning
Fine-tuning methods like RLHF steer the behaviour of a model by changing the weights according to examples of desired or undesired behaviour. Comparatively, RE has better sample efficiency, is more specific towards concepts, and provides higher usability and robustness.
RE reduces the noise in the alignment signal. RLHF compares two different texts, which might differ on many things aside from the desired alignment property. However, when RE measures the difference in activations these are taken from the same prompt that only differs wrt to the alignment property. This reduced noise will help sample efficiency and can remove spurious correlations in the human feedback.
RE is more usable in several ways. Firstly, RE is largely unsupervised, while RLHF requires a large number of human labellers, which makes the training process much cheaper. Secondly, the end-product of RE is a simple vector that can be added during the forward pass, while RLHF produces a whole new model. This makes it cheaper to store many different adaptations and also to switch them on or off at will. Thirdly, steering vectors correspond to human understandable concepts in natural language. Lastly, fine-tuning largely affects later layers [41], while RE can also be applied at earlier layers. Thus RE can lead to deeper and more robust shifts in the model’s reasoning.
Importantly, evidence shows that Representation Engineering adds benefits when a system prompt and safety fine-tuning have already been applied. Thus it can be wise to use all three of these approaches along with other methods like input- and output-filters, building a model safety-in-depth.
Why does it work?
It is somewhat surprising that simply adding a vector to the activations does effectively steer high-level behaviour while not destroying capabilities. Some reasons for this could be:
The Linear Representation Hypothesis: There is evidence that concepts are represented as linear directions in the activation space of Neural Networks [25]. The fact that adding a linear direction correlated with a concept does actually stimulate that concept is evidence for that hypothesis.
Resilience to activation manipulation: Neural Networks are able to remain functional after their activations are manipulated. This resilience might be attributable to dropout during training. Furthermore, the concept vector is derived from previous activations of the same network and is thus in-distribution. This makes it possible to retain general capabilities despite the changes.
LLMs already represent human-understandable concepts: LLMs seem to represent some human-understandable concepts like refusal or honesty. This makes it possible to tap into those representations and change the behaviour. When steering the concepts don’t need to be learned anew but simply identified and promoted.
Direct control over model state: The activations during the forward pass carry the state that the network is in and determine the final prediction. Activation space is also where concepts are expressed. Thus intervening on the activations is a more direct approach for influencing model outputs than changing model weights or inputs.
How could this help with AI Safety?
Opinions are divided [LW · GW] on how RE can help us make AI safer. Some see it as a more sample-efficient alternative to alignment fine-tuning that is also easy to turn on and off, thus making control of networks more convenient. Under this view, RE is a useful prosaic alignment technique that can be stacked on top of existing safety methods in a “Swiss cheese” model of safety. A more conceptual advantage of RE over RLHF is that it does not require training from human feedback and thus avoids failure modes like overoptimization.
Others emphasise the interpretability advantages of RE. Aside from generally giving us a better understanding of how NNs work, RE could serve as an alternative to activation patching and improve upon probes by validating the impact of vectors on behaviour. RE might find application in Eliciting Latent Knowledge, like identifying what a model internally believes to be true.
Lastly, others speculate that, after interpretability tools have found an optimiser and optimisation target represented in an NN, RE can be used to retarget the optimisation towards a desired representation.
Personal opinions
Fundamentally I believe that having more methods that give us control over the behaviour of LLMs is a good thing. RE might turn out to just be another useful tool in the box of prosaic alignment. It might end up having some advantages over RLHF like increased sample efficiency, easier on-and-off switching and being more specificity towards concepts (see earlier section on “differences with fine-tuning”). Overall RE could present a more direct path to steering models away from certain behaviours like power-seeking or deception.
Being able to steer behaviour without having to label it. Labeling it is hard because (1) things might be too complicated for humans to judge and (2) some behaviours might be hard to elicit. While RE does require a dataset of inputs related to the concept this could be significantly easier to generate than correct labels for the dataset e.g. when evaluating the truthfulness of statements. In this way, RE could be an effective method to elicit latent knowledge and behaviours from LLMs.
RE might present a way to put new interpretability results into practice. For example, one could use insights about the inner workings of a network as a prior for more effective representation reading. This would also make it possible to test the validity of these insights by their effect on the downstream behaviour after representation control.
Lastly, I want to provide concrete examples of how RE might end up being applied to aid in solving important safety problems. RE might be able to detect dishonest behaviour from the activations of a model and thus serve as a warning about deceptive behaviour. RE might be able to manipulate some internal drives and goals of an AI [36] like reducing its desire for power-seeking behaviour. RE might provide robust defences against jailbreak attacks. RE might open up the possibility to personalise the behaviour of models without needing to personalise the model weights, thus enabling efficient personalised alignment.
Open Problems and Limitations
Representation Engineering is a relatively new method. There are multiple flaws and open questions that need to be addressed to make this a reliable tool for researchers and developers.
Concept vectors might not be specific: Steering with a vector for one concept might also steer other concepts as a side effect. For example, side effects Zou et al [4] find that a happier model refuses fewer harmful requests. This can arise since some concepts are entangled in the representations of the model and since the inputs for some concepts might be correlated. Finding vectors that only represent a single concept would be crucial for more precise steerability. This could be achieved through more sparse interventions and active efforts to minimise side effects.
Concept vectors might not be complete: A concept vector might not catch the concept in all of its aspects. For example, one might want to steer Fairness, but find a vector which only represents fairness of outcomes, but not fairness of opportunity. This might arise because this aspect of the concept is not elicited from the inputs or because the model does not represent that aspect. Progress could be made through a thorough evaluation of concept vectors and careful construction of the set of inputs.
Reliance on model representations: RE is inherently limited to the model's representation of a concept. If a model does not represent a concept or does not represent some aspect of a concept, RE cannot steer it in that way. This limits the kind of concepts that are steerable via RE.
Negative results are inconclusive: We cannot guarantee that a derived concept vector is optimal for steering a concept. As a result, if a method fails to find a linear direction for a concept we cannot claim that that concept is not represented linearly in the model, since the RE method might have simply failed to identify the vector.
Extending to richer notions of representations: Currently RE is limited to linear representations that are statically applied to the activations. Future work might extend RE to be able to read and steer non-linear concept vectors or manifolds. It could attempt to dynamically adapt the concept vector to the inputs through state-spaces of representations. Furthermore, researchers might want to look at activations over time by considering trajectories of activations.
Combining together multiple concepts: Practically we will want to steer the model to more than one concept simultaneously. For example, we might want it to be harmless, helpful and honest. But different vectors might interfere with each other. This is because adding one concept vector changes the distribution of activations compared to the one that another concept vector was trained on. Furthermore, if concepts are not crisp they can have side-effects on other concepts. This could be enabled by making concept vectors more specific, sparse and adaptable.
Thorough evaluation of concept vectors: Currently concept vectors are not being thoroughly evaluated according to agreed upon quality criteria. Defining and implementing such criteria would give us crucial insights into the shortcomings of these methods, increase our confidence in employing RE and spur development in the field. Criteria might include specificity, completeness and generalisability of the concept vectors, the effectiveness of steering, suppression and detection using the concept vector, as well as sensitivity and robustness of the method to relevant and irrelevant changes in the inputs.
Prompt construction is not principled: Currently, the prompts used to elicit concept representations in an LLM are constructed ad-hoc and not in a principled way. This is a problem since LLMs can be sensitive to small changes in the prompt. Thus we cannot guarantee that a used prompt leads to the optimal concept vector. Researchers should study the sensitivity of RE to prompts, apply methods for automatic prompt discovery and identify principles for prompt construction for this task.
Understanding concept vectors: It is unclear what exactly a concept vector represents and how it influences the model's activations. A mechanistic analysis of concept vectors could shed light on specific circuits or model components that are being influenced. This could lead to methodological improvements or justified trust in RE methods. Conversely, RE could also point researchers to parts of the model that are interesting to analyse mechanistically.
Steering can reduce general model performance: Although RE usually doesn’t deteriorate model capabilities by much, these impacts still need to be minimised to make RE viable for use in production. This includes developing less invasive interventions and comprehensively measuring the impact on capabilities.
Steering can induce computational overhead: While adding an activation vector is not very expensive it can still slow down an LLM in production. Reducing this computational overhead is crucial for the scalable adoption of this method.
Applying RE: RE is a very versatile method that has been used to interpret and steer a range of concepts and behaviours. However, there are many more problems to which RE could be applied. For example, one could attempt to:
- Steer goals in LLM Agents
- Steer the pattern/algorithm applied to in-context learning
- Identify, elicit and reduce deceptive behaviour
- Steer cooperation in multi-agent scenarios
- Identify specific ethical values and steer them for personalised alignment
- Elicit harmful behaviours for red-teaming
- Identify inputs that the model has or hasn’t seen during training
Conclusion
Representation Engineering is an emerging research direction. It has been proven to be competitive and or superior at controlling model behaviour to fine-tuning or prompting on many problems. Although it is conceptually quite simple it should not be underrated as an effective tool for prosaic AI Safety research that can be used today to make models safer, especially since it seems to stack in effectiveness with other methods. However, the methods face some practical and fundamental limitations that need to be overcome to make it a more reliable and effective tool.
Thanks for comments and feedback to Domenic Rosati and Simon Marshall.
References
[28] Ackerman, C. (2024). Representation tuning. AI Alignment Forum. https://www.alignmentforum.org/posts/T9i9gX58ZckHx6syw/representation-tuning [AF · GW]
[40] Truner, A., peligrietzer., Mini, U., Monte M, Udell, D. (2023). Understanding and controlling a maze-solving policy network. AI Alignment Forum. https://www.alignmentforum.org/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network [AF · GW]
5 comments
Comments sorted by top scores.
comment by Logan Riggs (elriggs) · 2024-07-14T19:19:22.653Z · LW(p) · GW(p)
Strong upvoted to get from -6 to 0 karma. Would be great if someone who downvotes could explain?
My read of the paper is this is a topic Jan is researching, and they wrote up their own lit review for their own sake and other's if they're interested in the topic which isn't negative karma-worthy.
Replies from: stephen-mcaleese↑ comment by Stephen McAleese (stephen-mcaleese) · 2024-11-13T09:28:27.013Z · LW(p) · GW(p)
I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.
comment by Dan H (dan-hendrycks) · 2024-07-18T05:49:04.029Z · LW(p) · GW(p)
It's worth noting that activations are one thing you can modify, but many of the most performant methods (e.g., LoRRA) modify the weights. (Representations = {weights, activations}, hence "representation" engineering.)
comment by Logan Riggs (elriggs) · 2024-07-14T19:39:14.291Z · LW(p) · GW(p)
On activation patching:
The most salient difference is that RE doesn’t completely change the activations, but merely adds to them. Thus, RE aims to find how a certain concept is represented, while activation patching serves as an ablation of the function of a specific layer or neuron. Thus activation patching doesn’t directly tell you where to find the representation of a concept.
I'm pretty sure both methods give you some approximate location of the representation. RE is typically done on many layers & then picks the best layer. Activation patching ablates each layer & shows you which one is most important for the counterfactual. I would trust the result of patching more than RE for location of a representation (maybe grounded out as what a SAE would find?) due to being more in-distribution.
Replies from: Jan Wehner↑ comment by Jan Wehner · 2024-07-15T13:25:23.884Z · LW(p) · GW(p)
Thanks, I agree that Activation Patching can also be used for localizing representations (and I edited the mistake in the post).