Posts
Comments
Why is CE loss >= 5.0 everywhere? Looking briefly at GELU-1L over 128 positions (a short sequence length!) I see our models get 4.3 CE loss. 5.0 seems really high?
Ah, I see your section on this, but I doubt that bad data explains all of this. Are you using a very small sequence length, or an odd dataset?
From my perspective this term appeared around 2021 and became basically ubiquitous by 2022
I don't think this is correct. To add to Steven's answer, in the "GPT-1" paper from 2018 the abstract discusses
...generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task
and the assumption at the time was that the finetuning step was necessary for the models to be good at a given task. This assumption persisted for a long time with academics finetuning BERT on tasks that GPT-3 would eventually significantly outperformed them on. You can tell this from how cautious the GPT-1 authors are about claiming the base model could do anything, and they sound very quaint:
> We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability
The fact that Pythia generalizes to longer sequences but GPT-2 doesn't isn't very surprising to me -- getting long context generalization to work is a key motivation for rotary, e.g. the original paper https://arxiv.org/abs/2104.09864
Do you apply LR warmup immediately after doing resampling (i.e. immediately reducing the LR, and then slowly increasing it back to the normal value)? In my GELU-1L blog post I found this pretty helpful (in addition to doing LR warmup at the start of training)
(This reply is less important than my other)
> The network itself doesn't have a million different algorithms to perform a million different narrow subtasks
For what it's worth, this sort of thinking is really not obvious to me at all. It seems very plausible that frontier models only have their amazing capabilities through the aggregation of a huge number of dumb heuristics (as an aside, I think if true this is net positive for alignment). This is consistent with findings that e.g. grokking and phase changes are much less common in LLMs than toy models.
(Two objections to these claims are that plausibly current frontier models are importantly limited, and also that it's really hard to prove either me or you correct on this point since it's all hand-wavy)
Thanks for the first sentence -- I appreciate clearly stating a position.
measured over a single token the network layers will have representation rank 1
I don't follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this?
If so, I agree -- but I don't see how this applies to much SAE[1] or mech interp[2] work. Where do we disagree?
- ^
E.g. in this post here we show in detail how an "inside a question beginning with which" SAE feature is computed from which and predicts question marks (I helped with this project but didn't personally find this feature)
- ^
More generally, in narrow distribution mech interp work such as the IOI paper, I don't think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the "Adversarial examples" in Section 4.4 Alexandre found, for example)
Neel and I recently tried to interpret a language model circuit by attaching SAEs to the model. We found that using an L0=50 SAE while only keeping the top 10 features by activation value per prompt (and zero ablating the others) was better than an L0=10 SAE by our task-specific metric, and subjective interpretability. I can check how far this generalizes.
If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf
Two quick thoughts on why this isn't as concerning to me as this dialogue emphasized.
1. If we evaluate SAEs by the quality of their explanations on specific narrow tasks, full distribution performance doesn't matter
2. Plausibly the safety relevant capabilities of GPT (N+1) are a phase change from GPT N, meaning much larger loss increases in GPT (N+1) when attaching SAEs are actually competitive with GPT N (ht Tom for this one)
Is the drop of eval loss when attaching SAEs a crux for the SAE research direction to you? I agree it's not ideal, but to me the comparison of eval loss to smaller models only makes sense if the goal of the SAE direction is making a full-distribution competitive model. Explaining narrow tasks, or just using SAEs for monitoring/steering/lie detection/etc. doesn't require competitive eval loss. (Note that I have varying excitement about all these goals, e.g. some pessimism about steering)
> 0.3 CE Loss increase seems quite substantial? A 0.3 CE loss increase on the pile is roughly the difference between Pythia 410M and Pythia 2.8B
My personal guess is that something like this is probably true. However since we're comparing OpenWebText and the Pile and different tokenizers, we can't really compare the two loss numbers, and further there is not GPT-2 extra small model so currently we can't compare these SAEs to smaller models. But yeah in future we will probably compare GPT-2 Medium and GPT-2 Large with SAEs attached to the smaller models in the same family, and there will probably be similar degradation at least until we have more SAE advances.
It's very impressive that this technique could be used alongside existing finetuning tools.
> According to our data, this technique stacks additively with both finetuning
To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?
(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you're unsure about the statistical significance due to the low percentages involved)
I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.
Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2
Note that this behavior generalizes far beyond GPT-2 Small head 9.1. We wrote a paper and a easier-to-digest tweet thread
Thanks, fixed
The number of "features" scales at most linearly with the parameter count (for information theoretic reasons)
Why is this true? Do you have a resource on this?
Thanks!
In general after the Copy Suppression paper (https://arxiv.org/pdf/2310.04625.pdf) I'm hesitant to call this a Waluigi component -- in that work we found that "Negative IOI Heads" and "Anti-Induction Heads" are not specifically about IOI or Induction at all, they're just doing meta-processing to calibrate outputs.
Similarly, it seems possible the Waluigi components are just making the forbidden tokens appear with prob 10^{-3} rather than 10^{-5} or something like that, and would be incapable of actually making the harmful completion likely
What is a Waluigi component? A component that always says a fact, even when there is training to refuse facts in certain cases...?
I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!
Related: "To what extent does Veit et al still hold on transformer LMs?" feels to me a super tractable and pretty helpful paper someone could work on (Anthropic do some experiments but not many). People discuss this paper a lot with regard to NNs having a simplicity bias, as well as how the paper implies NNs are really stacks of many narrow heuristics rather than deep paths. Obviously empirical work won't provide crystal clear answers to these questions but this doesn't mean working on this sort of thing isn't valuable.
Yeah I agree with the intuition and hadn't made the explicit connection to the shallow paths paper, thanks!
I would say that Edge Attribution Patching is the extreme form of this https://arxiv.org/abs/2310.10348 , where we just ignored almost all subgraphs H except for G \ {e_1} (removing one edge only), and still got reasonable results, and agrees with some more upcoming results.
Why do you think that the sentiment will not be linearly separable?
I would guess that something like multiplying residual stream states by (ie the logit difference under the Logit Lens) would be reasonable (possibly with hacks like the tuned lens)
mechanistic anomaly detection hopes to do better by looking at the agent's internals as a black box
Do you mean "black box" in the sense that MAD does not assume interpretability of the agent? If so this is kinda confusing as "black box" is often used in contrast to "white box", ie "black box" means you have no access to model internals, just inputs+outputs (which wouldn't make sense in your context)
Thanks, fixed
Thanks for checking this! I mostly agree with all your original comment now (except the first part suggesting it was point blank, but we're quibbling over definitions at this point), this does seem like a case of intentionally not discussing risk
I think your interpretation is fairly uncharitable. If you have further examples of this deceptive pattern from those sympathetic to AI risk I would change my perspective but the speculation in the post plus this example weren't compelling:
I watched the video and firstly Senator Peters seems to trail off after the quoted part and ends his question by saying "What's your assessment of how fast this is going and when do you think we may be faced with those more challenging issues?". So straightforwardly his question is about timelines not about risk as you frame it. Indeed Matheny (after two minutes) literally responds "it's a really difficult question. I think whether AGI is nearer or farther than thought ..." (emphasis different to yours) so makes it likely to me Matheny is expressing uncertainty about timelines, not risk.
Overall I agree that this was an opportunity for Matheny to discuss AI x-risk and plausibly it wasn't the best use of time to discuss the uncertainty of the situation. But saying this is dishonesty doesn't seem well supported
I just saw this post and cannot parse it at all. You first say that you have removed the 9s of confidence. Then the next paragraph talks about a 99.9… figure. Then there are edit and quote paragraphs and I do not know whether these are your views or other or whether you endorse them.
The RepEng paper claims SOTA on TruthfulQA by 18%. Is this MC1 from here https://paperswithcode.com/sota/question-answering-on-truthfulqa ? Where is this number coming from? And why is the only maintext evaluation against any other method a single table evaluation against ActAdd (what about ITI? And surely there are other methods outside the LW sphere?)?
I’m glad there’s work trying to use model internals for useful things but the evidence didn’t seem that strong besides single prompts that don’t provide me with that much signal
By "crisp internal representations" I mean that
- The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : (
- The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it's possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what's going on.
I'm not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.
I think that your comment on (1) is too pessimistic about the possibility of stopping deployment of a misaligned model. That would be a pretty massive result! I think this would have cultural or policy benefits that are pretty diverse, so I don't think I agree that this is always a losing strategy -- after revealing misalignment to relevant actors and preventing deployment, the space of options grows a lot.
I'm not sure anything I wrote in (2) is close to understanding all the important safety properties of an AI. For example, the grokking work doesn't explain all the inductive biases of transformers/Adam, but it has helped better reasoning about transformers/Adam. Is there something I'm missing?
On (3) I think that rewarding the correct mechanisms in models is basically an extension of process-based feedback. This may be infeasible or only be possible while applying lots of optimization pressure on a model's cognition (which would be worrying for the various list of lethalities reasons). Are these the reasons you're pessimistic about this, or something else?
I like your writing, but AFAIK you haven't written your thoughts on interp prerequisites for useful alignment - do you have any writing on this?
Kudos for providing concrete metrics for frontier systems, receiving pretty negative feedback on one of those metrics (dataset size), and then updating the metrics.
It would be nice if both the edit about the dataset size restriction was highlighted more clearly (in both your posts and critic comments).
1. I would have thought that VNM utility has invariance with alpha>0 not alpha>=0, is this correct?
2. Is there any alternative to dropping convex-linearity (perhaps other than changing to convexity, as you mention)? Would the space of possible optimisation functions be too large in this case, or is this an exciting direction?
Doesn't Figure 7, top left from the arXiv paper provide evidence against the "network is moving through dozens of different basins or more" picture?
Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!
No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight
This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing
> Can we just add in times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.
Do you have evidence for this? It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here. EDIT: see reply.
I even tested this:
Empirically in TransformerLens the 5*Love and 5*(Love-Hate) additions were basically identical from a blind trial on myself (I found 5*Love more loving 15 times compared to 5*(Love-Hate) more loving 12 times, and I independently rated which generations were more coherent, and both additions were more coherent 13 times. There were several trials where performance on either loving-ness or coherence seemed identical to me).
(Personal opinion as an ACDC author) I would agree that causal scrubbing aims mostly to test hypotheses and ACDC does a general pass over model components to hopefully develop a hypothesis (see also the causal scrubbing post's related work and ACDC paper section 2). These generally feel like different steps in a project to me (though some posts in the causal scrubbing sequence show that the method can generate hypotheses, and also ACDC was inspired by causal scrubbing). On the practical side, the considerations in How to Think About Activation Patching may be a helpful field guide for projects where one of several tools might be best for a given use case. I think both causal scrubbing are currently somewhat inefficient for different reasons; causal scrubbing is slow because causal diagrams have lots of paths in them, and ACDC is slow as computational graphs have lots of edges in them (follow up work to ACDC will do gradient descent over the edges).
I think that this critique is a bit overstated.
i) I would guess that human eval is in general better than most benchmarks. This is because it's a mystery how much benchmark performance is explained by prompt leakage and benchmarks being poorly designed (e.g crowd-sourcing benchmarks has issues with incorrect or not useful tests, and adversarially filtered benchmarks like TruthfulQA have selection effects on their content which make interpreting their results harder, in my opinion)
ii) GPT-4 is the best model we have access to. Any competition with GPT-4 is competition with the SOTA available model! This is a much harder reference class to compare to than models trained with the same compute, models trained without fine-tuning etc.
Note that this seems fairly easy for GPT-4 + a bit of nudging
This is particularly impressive since ChatGPT isn't capable of code-switching (though GPT-4 seems to be from a quick try)
Nice toy circuit - interesting that it can describe both attention-head and MLP key-value stores. Do you have any intuition about what mechanistically the model is doing on the negatively reinforced text? E.g is the toy model intended to suggest that there is some feature represented as +1 or -1 in the residual stream? If so do you expect this to be a direction or non-linear?
Am I missing something or is GPT-4 able to do Length 20 Dynamic Programming using a solution it described itself very easily?
https://chat.openai.com/share/8d0d38c0-e8de-49f3-8326-6ab06884df90
We have 100k context models and several OOMs more FLOPs to throw at models, I couldn't see a reason why autoregressive models were limited in a substantial way given the evidence in the paper
Oh huh - those eyes, webs and scales in Slide 43 of your work are really impressive, especially given the difficulty extending these methods to transformers. Is there any write-up of this work?
I am a bit confused by your operationalization of "Dangerous". On one hand
I posit that interpretability work is "dangerous" when it enhances the overall capabilities of an AI system, without making that system more aligned with human goals
is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:
This suggests a few concrete rules-of-thumb, which a researcher can apply to their interpretability project P: ...
If P makes it easier/more efficient to train powerful AI models, then P is dangerous.
Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we're using here)
This was a nice description, thanks!
However, regarding
comprehensively interpreting networks [... aims to] identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean)
I think this is incredibly optimistic hope that I think need be challenged more.
On my model GPT-N has a mixture of a) crisp representation, b) fuzzy heuristics made are made crisp in GPT-(N+1) and c) noise and misgeneralizations. Unless we're discussing models that perfectly fit their training distribution, I expect comprehensively interpreting networks involves untangling many competing fuzzy heuristics which are all imperfectly implemented. Perhaps you expect this to be possible? However, I'm pretty skeptical this is tractable and expect the best good interpretability work to not confront these completeness guarentees.
Related (I consider "mechanistic interpretability essentially solved" to be similar to your "comprehensive interpreting" goal)
I liked that you found a common thread in several different arguments.
However, I don't think that the views are all believed or all disagreed with in practice. I do think Yann LeCun would agree with all the points and Eliezer Yudkowsky would disagree with all the points (except perhaps the last point).
For example, I agree with 1 and 5, agree with the first half but not the second half of 2 disagree with 3 and have mixed feelings about 4.
Why? At a high level, I think the extent to which individual researchers, large organizations and LLMs/AIs need empirical feedback to improve are all quite different.
I agree that some level public awareness would not have been reached without accessible demos of SOTA models.
However, I don’t agree with the argument that AI capabilities should be released to increase our ability to ‘rein it in’ (I assume you are making an argument against a capabilities ‘overhang’ which has been made on LW before). This is because text-davinci-002 (and then 3) were publicly available but not accessible to the average citizen. Safety researchers knew these models existed and were doing good work on them before ChatGPT’s release. Releasing ChatGPT results in shorter timelines and hence less time for safety researchers to do good work.
To caveat this: I agree ChatGPT does help alignment research, but it doesn’t seem like researchers are doing things THAT differently based on its existence. And secondly I am aware that OAI did not realise how large the hype and investment would be from ChatGPT, but nevertheless this hype and investment is downstream of a liberal publishing culture which is something that can be blamed.
I agree that ChatGPT was positive for AI-risk awareness. However from my perspective being very happy about OpenAI's impact on x-risk does not follow from this. Releasing powerful AI models does have a counterfactual effect on the awareness of risks, but also a lot of counterfactual hype and funding (such as the vast current VC investment in AI) which is mostly pointed at general capabilities rather than safety, which from my perspective is net negative.
Given past statements I expect all lab leaders to speak on AI risk soon. However, I bring up the FLI letter not because it is an AI risk letter, but because it is explicitly about slowing AI progress, which OAI and Anthropic have not shown that much support for
Thanks for writing this. As far as I can tell most anger about OpenAI is because i) being a top lab and pushing SOTA in a world with imperfect coordination shortens timelines and ii) a large number of safety-focused employees left (mostly for Anthropic) and had likely signed NDAs. I want to highlight i) and ii) in a point about evaluating the sign of the impact of OpenAI and Anthropic.
Since Anthropic's competition seems to me to be exacerbating race dynamics currently (and I will note that very few OpenAI and zero Anthropic employees signed the FLI letter) it seems to me that Anthropic is making i) worse due to coordination being more difficult and race dynamics. At this point, believing Anthropic is better on net than OpenAI has to go through believing *something* about the reasons individuals had for leaving OpenAI (ii)), and that these reasons outweigh the coordination and race dynamic considerations. This is possible, but there's little public evidence for the strength of these reasons from my perspective. I'd be curious if I've missed something from my point.
This occurs across different architectures and datasets (https://arxiv.org/abs/2203.16634)
[from a quick skim this video+blog post doesn't mention this]