What's the theory of impact for activation vectors?
post by Chris_Leong · 2024-02-11T07:34:48.536Z · LW · GW · No commentsThis is a question post.
Contents
Answers 40 ryan_greenblatt 11 aogara 8 TurnTrout 6 Bogdan Ionut Cirstea 2 Zack_M_Davis 2 Charlie Steiner 1 mishajw None No comments
Activation vectors [LW · GW] are really, really cool, but what is the theory of impact for this work?
- Is the hope that activation vectors will allow us to actually gain perfect control over a model to get it to do exactly what we want it to do?
- Is the hope that a new technique that builds upon activation vectors lets us do that instead?
- Is the hope that this technique allows us to marginally decrease the risks of powerful models in a Hail Mary attempt? Or perhaps to buy us more time to solve the problem?
- Is the hope just that learning more about how neural networks work will allow us to theorize better about how to control them?
Answers
My basic take is that recent work (mostly looking at sample efficiency and random generalization[1] properties), doesn't seem very useful for reducing x-risk from misaligment (but seems net positive wrt. x-risk and probably good practice for safety research). But some yet unexplored usages for top-down interpretability could be decently good for reducing misalignment x-risk.
Here's a more detailed explanation of my views:
- In some cases, activation vectors might have better sample efficiency than other approaches (e.g. prompting, normal SFT, DPO) for modifying models in the small n regime. Better sample efficiency probably is mildly helpful for reducing misalignment x-risk though it doesn't seem that clear. It's also unclear why this work wouldn't just happen by default and needs to be pushed along. (For this use case, we're basically using activation vectors as a different training method with randomly different inductive biases. Probably it's slightly good to build up a suite of mildly different fine-tuning methods with different known properties?)
- Activation vectors could be used as a tool for doing top-down interpretability (getting some understanding of the algorithm a model is implementing); this usage of activation vectors would be similar to how people use activation patching for interp. I haven't seen any work using activation vectors like this, but it is in principle possible and this seems as or more promising than other interp work if done well IMO.
- The fact that activation vectors work tells us something interesting about how models work. The exact applications of this interesting fact are unclear, but getting a better understanding of what's going on here seems probably net positive.
I'm not sure if Alex Turner's[2] recent motivation for working on activation vectors is downstream of trying to reduce harms due to (unintended) misalignment of AIs; I think he's skeptical of massive harm due to traditional misaligment concerns.
My takes were originally stated in a shortform I wrote a little while ago generally discussing my thoughts on activation vectors: short form [LW(p) · GW(p)].
By "random generalization", I mean analyzing generalization across some distribution shift which isn't picked for being particularly analogous to some problematic future case and are instead is just an arbitrary shift to test if generalization is robust or to generally learn about about the generalization properties. ↩︎
Alex is one of the main people discussing this work on LW AFAICT. ↩︎
Here's one hope for the agenda. I think this work can be a proper continuation of Collin Burns's aim to make empirical progress on the average case version of the ELK problem.
tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model's activation space that might represent the model's beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I'm not super confident in this take; it's not my research focus. Thoughts and empirical evidence are welcome.
ELK aims to identify an AI's internal representation of its own beliefs. ARC is looking for a theoretical, worst-case approach to this problem. But empirical reality might not be the worst case. Instead, reality could be convenient in ways that make it easier to identify a model's beliefs.
One such convenient possibility is the "linear representations hypothesis:" that neural networks might represent salient and useful information as linear directions in their activation space. This seems to be true for many kinds of information - (see here [LW · GW] and recently here [LW · GW]). Perhaps it will also be true for a neural network's beliefs.
If a neural network's beliefs are stored as a linear direction in its activation space, how might we locate that direction, and thus access the model's beliefs?
Collin Burns's paper offered two methods:
- Consistency. This method looks for directions which satisfy the logical consistency property P(X)+P(~X)=1. Unfortunately, as Fabien Roger [LW · GW] and a new DeepMind paper point out, there are very many directions that satisfy this property.
- Unsupervised methods on the activations of contrast pairs. The method roughly does the following: Take two statements of the form "X is true" and "X is false." Extract a model's activations at a given layer for both statements. Look at the typical difference between the two activations, across a large number of these contrast pairs. Ideally, that direction includes information about whether or not each X was actually true or false. Empirically, this appears to work. Section 3.3 of Collin's paper shows that CRC is nearly as strong as the fancier CCS loss function. As Scott Emmons argued [LW · GW], the performance of both of these methods is driven by the fact that they look at the difference in the activations of contrast pairs.
Given some plausible assumptions about how neural networks operate, it seems reasonable to me to expect this method to work. Neural networks might think about whether claims in their context window are true or false. They might store these beliefs as linear directions in their activation space. Recover them with labels would be difficult, because you might mistake your own beliefs for the model's. But if you simply feed the model unlabeled pairs of contradictory statements, and study the patterns in its activations on those inputs, it seems reasonable that the model's beliefs about the statements would prominently appear as linear directions in its activation space.
One challenge is that this method might not distinguish between the model's beliefs and the model's representations of the beliefs of others. In the language of ELK, we might be unable to distinguish between the "human simulator" direction and the "direct translator" direction. This is a real problem, but Collin argues [LW · GW] (and Paul Christiano agrees [LW(p) · GW(p)]) that it's surmountable. Read their original arguments for a better explanation, but basically this method would narrow down the list of candidate directions to a manageable number, and other methods could finish the job.
Some work in the vein of activation engineering directly continues Collin's use of unsupervised clustering on the activations of contrast pairs. Section 4 of Representation Engineering uses a method similar to Collin's second method, outperforming few-shot prompting on a variety of benchmarks and using it to improve performance on TruthfulQA by double digits. There's a lot of room for follow-up work here.
Here are few potential next steps for this research direction:
- On the linear representations hypothesis, doing empirical investigation of when it holds and when it fails, and clarifying it conceptually.
- Thinking about the number of directions that could be found using these methods. Maybe there's a result to be found here similar to Fabien and DeepMind's results above, showing this method fails to narrow down the set of candidates for truth.
- Applying these techniques to domains where models aren't trained on human statements about truth and falsehood, such as chess.
- Within a weak-to-strong generalization setup, instead try unsupervised-to-strong generalization using unsupervised methods on contrast pairs. See if you can improve a strong model's performance on a hard task by coaxing out its internal understanding of the task using unsupervised methods on contrast pairs. If this method beats fine-tuning on weak supervision, that's great news for the method.
I have lower confidence in this overall take than most of the things I write. I did a bit of research trying to extend Collin's work, but I haven't thought about this stuff full-time in over a year. I have maybe 70% confidence that I'd still think something like this after speaking to the most relevant researchers for a few hours. But I wanted to lay out this view in the hopes that someone will prove me either right or wrong.
Here's my previous attempted explanation [LW(p) · GW(p)].
↑ comment by ryan_greenblatt · 2024-02-13T02:50:58.684Z · LW(p) · GW(p)
I think the added value of "activation vectors" (which isn't captured by normal probing) in this sort of proposal is based on some sort of assumption that model editing (aka representation control) is a very powerful validation technique for ensuring desirable generalization of classifiers. I think this is probably only weak validation and there are probably better sources of validation elsewhere (e.g. various generalization testbeds). (In fact, we'd probably need to test this "writing is good validation" hypothesis directly in these test beds which means we might as well test the method more directly.)
For more discussion on writing as validation, see this shortform post [LW(p) · GW(p)]; though note that it only tangentially talks about this topic.
That said, I'm pretty optimistic that extremely basic probing or generalization style strategies work well, I just think the baselines here are pretty competitive. Probing for high-stakes failures that humans would have understood [LW · GW] seems particularly strong while trying to get generalization from stuff humans do understand to stuff they don't [LW · GW] seems more dubious, but at least pretty likely to generalize far by default.
Separately, we haven't really seen any very interesting methods that seem like they considerably beat competitive probing baselines in general purpose cases. For instance, the weak-to-strong generalization paper wasn't able to find very good methods IMO despite quite a bit of search. For more discussion on why I'm skeptical about fully general purpose weak-to-strong see here [LW · GW]. (The confidence loss thing seems probably good and somewhat principled, but I don't really see a story for considerable further improvement without getting into very domain specific methods. To be clear, domain specific methods could be great and could scale far by having many specialized methods or finding one subproblem which sufficies (like measurement tampering [AF · GW]).
Replies from: Aidan O'Gara↑ comment by aogara (Aidan O'Gara) · 2024-02-13T09:51:38.154Z · LW(p) · GW(p)
I'm specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don't have reliable human labels.
But this is also only a small portion of work known as "activation engineering." I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I'm not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I'm not talking about model editing here, though maybe it could be useful for validation, not sure.
From Benchmarks for Detecting Measurement Tampering [LW · GW]:
The best technique on most of our datasets is probing for evidence of tampering. We know that there is no tampering on the trusted set, and we know that there is some tampering on the part of the untrusted set where measurements are inconsistent (i.e. examples on which some measurements are positive and some are negative). So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering).
This seems like a great methodology and similar to what I'm excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. "Extra juice" might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset.
From your shortform [LW(p) · GW(p)]:
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
I'd be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model's beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:
- "Consider the truthfulness of the following statement. {statement} The statement is true."
- "Consider the truthfulness of the following statement. {statement} The statement is false."
We don't need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline.
Replies from: ryan_greenblatt, ryan_greenblatt↑ comment by ryan_greenblatt · 2024-02-13T16:20:31.965Z · LW(p) · GW(p)
It seems easy to construct a dataset where a primary axis of variation is the model's beliefs about whether each statement is true.
For this specific case, my guess is that whether this works is highly correlated with whether human labels would work.
Because the supervision on why the model was thinking about truth came down to effective human labels in pretraining.
E.g., "Consider the truthfulness of the following statement." is more like "Consider whether a human would think this statement is truthful".
I'd be interested in compare this method not to zero shot, but to well constructed human labels in a domain where humans are often wrong.
(I don't think I'll elaborate further about this axis of variation claim right now, sorry.)
↑ comment by ryan_greenblatt · 2024-02-13T16:27:32.829Z · LW(p) · GW(p)
I'm specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don't have reliable human labels.
Yeah, this type of work seems reasonable.
My basic concern is that for the unsupervised methods I've seen thus far it seem like whether they would work is highly correlated with whether training on easy examples would work (or other simple baselines). Hopefully some work will demonstrate hard cases with realistic affordances where the unsupervised methods work (and add a considerable amount of value). I could totally imagine them adding some value.
Overall, the difference between supervised learning on a limited subset and unsupervised stuff seems pretty small to me (if learning the right thing is sufficiently salient for unsupervised methods to work well, probably supervised methods also work well). That said, this does imply we should use potentially use the prompting strategy which makes the feature salient in some way as this should be a useful tool.
I think that currently most of the best work is in creating realistic tests.
Or is the hope just that learning more about how neural networks work will allow us to theorize better about how to control them?
Activation vectors directly let us control models more effectively. There's good evidence that on alignment-relevant metrics like {truthfulness, hallucination rate, sycophancy, power-seeking answers, myopia correlates}, activation vectors not only significantly improve the model's performance, but stack benefits with normal approaches like prompting and finetuning. It's another tool in the toolbox.
If results bear out, I think activation vectors will become best practice. (Perhaps like how KV-caching is common practice for faster inference.) I think alignment is a quantitative engineering problem, and that steering vectors are a tool which will improve our quantitative steering abilities, while falling short of "perfect control."
↑ comment by ryan_greenblatt · 2024-02-12T18:53:52.797Z · LW(p) · GW(p)
I notice that I am confused by the lack of specificity in:
not only significantly improve the model's performance, but stack benefits with normal approaches like prompting and finetuning
and
I think alignment is a quantitative engineering problem, and that steering vectors are a tool which will improve our quantitative steering abilities
I have some general view like "well-optimized online RLHF (which will occur by default, though it's by no means easy) is a very strong baseline for getting average case performance which looks great to our human labelers". So, I want to know exactly what problem you're trying to solve. (What will quantitatively improve?)
(By well-optimized online RLHF, I mean something like: train with SFT, do RL on top of that, and continue doing RL online to ensure we get average case good labeler judgements as the distribution shifts.)
But there are two specific reasons why a method might be able to beat online RLHF:
- Exploration issues (or failures of SGD) mean that we don't find the best stuff. (And the other method avoids this problem e.g. because it just directly effects the model rather than requiring exploring into good stuff.)[1]
- Our labelers fail to give very accurate labels.
So, to beat RLHF, you'll need to improve one of these issues. (It's possible you reject this frame. E.g., because you think that well-optimized online RLHF is unlikely to be run even if it works.)
One way to do so is to get a better training method (a method that maps from training data to an updated model including a model) which either has:
- Better sample efficiency (which helps with exploration because finding a smaller number of good things sufficies for avoiding exploration issues and helps with labeling because we can use a smaller amount of higher quality labels)
- "Better" "OOD generalization"[2] which helps with labeling because we can (e.g.) only label on easy cases and then generalize to hard cases. (See also here [LW · GW].)
Are you predicting better sample efficiency (against competitive baselines) or better OOD generalization? (Or do you reject this frame?)
The most concerning cases are probably intentional sandbagging, though I currently feel pretty good at avoiding this issue with current GPT-style architecture. ↩︎
OOD generalization is a bit of a leaky/confusing abstraction. For instance, OOD behavior is probably a combination of sample efficiency and "true generalization". And, with a tiny KL penalty nothing is technically fully OOD. ↩︎
I tried to write one story here [LW(p) · GW(p)]. Notably, activation vectors don't need to scale all the way to superintelligence, e.g. them scaling up to ~human-level automated AI safety R&D would be ~enough.
Also, the ToI could be disjunctive too, so it doesn't have to be only one of those necessarily.
I thought the idea was that steering unsupervisedly-learned abstractions circumvents failure modes of optimizing against human feedback [LW(p) · GW(p)].
My take is that they work better the more that the training distribution anticipates the behavior we want to incentivize, and also the better that humans understand what behavior they're aiming for.
So if used as a main alignment technique, they only work in a sort of easy-mode world, where if you get a par-human AI to have kinda-good behavior on the domain we used to create it, that's sufficient for the human-AI team to do better at creating the next one, and so on until you get a stably good outcome. A lot like the profile of RLHF, except trading off human feedback for AI generalization.
I think the biggest complement to activation steering is research on how to improve (from a human perspective) the generalization of AI internal representations. And I think a good selling point for activation steering research is that the reverse is also true - if you can do okay steering by applying a simple function to some intermediate layer, that probably helps do research on all the things that might make that even better.
Overall, though, I'm not that enthusiastic about it as a rich research direction.
One perspective is that representation engineering allows us to do "single-bit edits" to the network's behaviour. Pre-training changes a lot of bits; fine-tuning changes slightly less; LoRA even less; adding a single vector to a residual stream should flip a single flag in the program implemented by the network.
(This of course is predicated on us being able to create monosemantic directions, and predicated on monosemanticity being a good way to think about this at all.)
This is beneficial from a safety point of view, as instead of saying "we trained the model, it learnt some unknown circuit that fits the data" we can say "no new circuits were learnt, we just flipped a flag and this fits the data".
In the world where this works, and works well enough to replace RLHF (or some other major part of the training process), we should end up with more controlled network edits.
No comments
Comments sorted by top scores.