Posts
Comments
By "crisp internal representations" I mean that
- The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : (
- The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it's possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what's going on.
I'm not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.
I think that your comment on (1) is too pessimistic about the possibility of stopping deployment of a misaligned model. That would be a pretty massive result! I think this would have cultural or policy benefits that are pretty diverse, so I don't think I agree that this is always a losing strategy -- after revealing misalignment to relevant actors and preventing deployment, the space of options grows a lot.
I'm not sure anything I wrote in (2) is close to understanding all the important safety properties of an AI. For example, the grokking work doesn't explain all the inductive biases of transformers/Adam, but it has helped better reasoning about transformers/Adam. Is there something I'm missing?
On (3) I think that rewarding the correct mechanisms in models is basically an extension of process-based feedback. This may be infeasible or only be possible while applying lots of optimization pressure on a model's cognition (which would be worrying for the various list of lethalities reasons). Are these the reasons you're pessimistic about this, or something else?
I like your writing, but AFAIK you haven't written your thoughts on interp prerequisites for useful alignment - do you have any writing on this?
Kudos for providing concrete metrics for frontier systems, receiving pretty negative feedback on one of those metrics (dataset size), and then updating the metrics.
It would be nice if both the edit about the dataset size restriction was highlighted more clearly (in both your posts and critic comments).
1. I would have thought that VNM utility has invariance with alpha>0 not alpha>=0, is this correct?
2. Is there any alternative to dropping convex-linearity (perhaps other than changing to convexity, as you mention)? Would the space of possible optimisation functions be too large in this case, or is this an exciting direction?
Doesn't Figure 7, top left from the arXiv paper provide evidence against the "network is moving through dozens of different basins or more" picture?
Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!
No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight
This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing
> Can we just add in times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.
Do you have evidence for this? It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here. EDIT: see reply.
I even tested this:
Empirically in TransformerLens the 5*Love and 5*(Love-Hate) additions were basically identical from a blind trial on myself (I found 5*Love more loving 15 times compared to 5*(Love-Hate) more loving 12 times, and I independently rated which generations were more coherent, and both additions were more coherent 13 times. There were several trials where performance on either loving-ness or coherence seemed identical to me).
(Personal opinion as an ACDC author) I would agree that causal scrubbing aims mostly to test hypotheses and ACDC does a general pass over model components to hopefully develop a hypothesis (see also the causal scrubbing post's related work and ACDC paper section 2). These generally feel like different steps in a project to me (though some posts in the causal scrubbing sequence show that the method can generate hypotheses, and also ACDC was inspired by causal scrubbing). On the practical side, the considerations in How to Think About Activation Patching may be a helpful field guide for projects where one of several tools might be best for a given use case. I think both causal scrubbing are currently somewhat inefficient for different reasons; causal scrubbing is slow because causal diagrams have lots of paths in them, and ACDC is slow as computational graphs have lots of edges in them (follow up work to ACDC will do gradient descent over the edges).
I think that this critique is a bit overstated.
i) I would guess that human eval is in general better than most benchmarks. This is because it's a mystery how much benchmark performance is explained by prompt leakage and benchmarks being poorly designed (e.g crowd-sourcing benchmarks has issues with incorrect or not useful tests, and adversarially filtered benchmarks like TruthfulQA have selection effects on their content which make interpreting their results harder, in my opinion)
ii) GPT-4 is the best model we have access to. Any competition with GPT-4 is competition with the SOTA available model! This is a much harder reference class to compare to than models trained with the same compute, models trained without fine-tuning etc.
Note that this seems fairly easy for GPT-4 + a bit of nudging
This is particularly impressive since ChatGPT isn't capable of code-switching (though GPT-4 seems to be from a quick try)
Nice toy circuit - interesting that it can describe both attention-head and MLP key-value stores. Do you have any intuition about what mechanistically the model is doing on the negatively reinforced text? E.g is the toy model intended to suggest that there is some feature represented as +1 or -1 in the residual stream? If so do you expect this to be a direction or non-linear?
Am I missing something or is GPT-4 able to do Length 20 Dynamic Programming using a solution it described itself very easily?
https://chat.openai.com/share/8d0d38c0-e8de-49f3-8326-6ab06884df90
We have 100k context models and several OOMs more FLOPs to throw at models, I couldn't see a reason why autoregressive models were limited in a substantial way given the evidence in the paper
Oh huh - those eyes, webs and scales in Slide 43 of your work are really impressive, especially given the difficulty extending these methods to transformers. Is there any write-up of this work?
I am a bit confused by your operationalization of "Dangerous". On one hand
I posit that interpretability work is "dangerous" when it enhances the overall capabilities of an AI system, without making that system more aligned with human goals
is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:
This suggests a few concrete rules-of-thumb, which a researcher can apply to their interpretability project P: ...
If P makes it easier/more efficient to train powerful AI models, then P is dangerous.
Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we're using here)
This was a nice description, thanks!
However, regarding
comprehensively interpreting networks [... aims to] identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean)
I think this is incredibly optimistic hope that I think need be challenged more.
On my model GPT-N has a mixture of a) crisp representation, b) fuzzy heuristics made are made crisp in GPT-(N+1) and c) noise and misgeneralizations. Unless we're discussing models that perfectly fit their training distribution, I expect comprehensively interpreting networks involves untangling many competing fuzzy heuristics which are all imperfectly implemented. Perhaps you expect this to be possible? However, I'm pretty skeptical this is tractable and expect the best good interpretability work to not confront these completeness guarentees.
Related (I consider "mechanistic interpretability essentially solved" to be similar to your "comprehensive interpreting" goal)
I liked that you found a common thread in several different arguments.
However, I don't think that the views are all believed or all disagreed with in practice. I do think Yann LeCun would agree with all the points and Eliezer Yudkowsky would disagree with all the points (except perhaps the last point).
For example, I agree with 1 and 5, agree with the first half but not the second half of 2 disagree with 3 and have mixed feelings about 4.
Why? At a high level, I think the extent to which individual researchers, large organizations and LLMs/AIs need empirical feedback to improve are all quite different.
I agree that some level public awareness would not have been reached without accessible demos of SOTA models.
However, I don’t agree with the argument that AI capabilities should be released to increase our ability to ‘rein it in’ (I assume you are making an argument against a capabilities ‘overhang’ which has been made on LW before). This is because text-davinci-002 (and then 3) were publicly available but not accessible to the average citizen. Safety researchers knew these models existed and were doing good work on them before ChatGPT’s release. Releasing ChatGPT results in shorter timelines and hence less time for safety researchers to do good work.
To caveat this: I agree ChatGPT does help alignment research, but it doesn’t seem like researchers are doing things THAT differently based on its existence. And secondly I am aware that OAI did not realise how large the hype and investment would be from ChatGPT, but nevertheless this hype and investment is downstream of a liberal publishing culture which is something that can be blamed.
I agree that ChatGPT was positive for AI-risk awareness. However from my perspective being very happy about OpenAI's impact on x-risk does not follow from this. Releasing powerful AI models does have a counterfactual effect on the awareness of risks, but also a lot of counterfactual hype and funding (such as the vast current VC investment in AI) which is mostly pointed at general capabilities rather than safety, which from my perspective is net negative.
Given past statements I expect all lab leaders to speak on AI risk soon. However, I bring up the FLI letter not because it is an AI risk letter, but because it is explicitly about slowing AI progress, which OAI and Anthropic have not shown that much support for
Thanks for writing this. As far as I can tell most anger about OpenAI is because i) being a top lab and pushing SOTA in a world with imperfect coordination shortens timelines and ii) a large number of safety-focused employees left (mostly for Anthropic) and had likely signed NDAs. I want to highlight i) and ii) in a point about evaluating the sign of the impact of OpenAI and Anthropic.
Since Anthropic's competition seems to me to be exacerbating race dynamics currently (and I will note that very few OpenAI and zero Anthropic employees signed the FLI letter) it seems to me that Anthropic is making i) worse due to coordination being more difficult and race dynamics. At this point, believing Anthropic is better on net than OpenAI has to go through believing *something* about the reasons individuals had for leaving OpenAI (ii)), and that these reasons outweigh the coordination and race dynamic considerations. This is possible, but there's little public evidence for the strength of these reasons from my perspective. I'd be curious if I've missed something from my point.
This occurs across different architectures and datasets (https://arxiv.org/abs/2203.16634)
[from a quick skim this video+blog post doesn't mention this]
Thank you for providing this valuable synthesis! In reference to:
4. It is a form of movement building.
Based on my personal interactions with more experienced academics, many seem to view the objectives of mechanistic interpretability as overly ambitious (see Footnote 3 in https://distill.pub/2020/circuits/zoom-in/ as an example). This perception may deter them from engaging in interpretability research. In general, it appears that advancements in capabilities are easier to achieve than alignment improvements. Together with the emphasis on researcher productivity in the ML field, as measured by factors like h-index, academics are incentivized to select more promising research areas.
By publishing mechanistic interpretability work, I think the perception of the field's difficulty can be changed for the better, thereby increasing the safety/capabilities ratio of the ML community's output. As the original post acknowledges, this approach could have negative consequences for various reasons.
Which sorts of works are you referring to on Chris Olah's blog? I see mostly vision interpretability work (which has not helped with vision capabilities), RNN stuff (which essentially does not help capabilities because of transformers) and one article on back-prop, which is more engineering-adjacent but probably replaceable (I've seen pretty similar explanations in at least one publicly available Stanford course).
Is bullet point one true, or is there a condition that I'm not assuming? E.g if $V$ is the constant $0$ random variable and $X$ is $N(0, 1)$ then the limit result holds, but a Gaussian is neither heavy- nor long-tailed.
Are you familiar with the Sparks of AGI folks' work on a practical example of GPT-4 asserting early errors in CoT are typos (timsestamped link)? Pretty good prediction if not
That's true, I think the pretraining gradients training choice probably has more effect on the end model than the overfitting SFT model they start PPO with.
Huh, but Mysteries of mode collapse (and the update) were published before td-003 was released? How would you have ended up reading a post claiming td-002 was RLHF-trained when td-003 existed?
Meta note: it's plausibly net positive that all the training details of these models has been obfuscated, but it's frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.
I wasn't trying to say mode collapse results were wrong! I collected these results before finding crisper examples of mode collapse that I could build a useful interpretability project on. I also agree with the remarks made about the difficulty of measuring this phenomena. I indeed tried to use the OpenAI embeddings model to encode the various completions and then hopefully have the Euclidean distance be informative, but it seemed to predict large distances for similar completions so I gave up. I also made a consistent color scheme and compared code-davinci, thanks for those suggestions.
I don't get the impression that RLHF needs hacks to prevent mode collapse: the InstructGPT reports overfitting leading to better human-rater feedback, and the Anthropic HH paper mentions in passing that the KL penalty may be wholly irrelevant (!).
I'm not sure how to interpret the evidence from your first paragraph. You suggest that td-003 mode collapses where td-002 is perfectly capable. So you believe that both td-002 and td-003 mode collapse, in disjoint cases (given the examples from the original mode collapse post)?
Thanks - this doesn't seem to change observations much, except there doesn't seem to be a case where this model has starkly the lowest entropy, as we found with davinci
EDIT: I added code-davinci-002 as the main focus of the post, thanks!
its top singular vector encodes what we think are the *least* frequent tokens.
I spot "GoldMagikarp" and "Skydragon" - we now know these are indeed very infrequent tokens! This was a good evidence for SolidGoldMagikarp lurking in plain sight : )
I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I've seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently
I had the identical reaction that the statement of this effect was a bizarre framing. @afspies's comment was helpful - I don't think the claim is as bizarre now.
(though overall I don't think this post is a useful contribution because it is more likely to confuse than to shed light on LMs)
I meant your first point.
Regarding the claim that finetuning on data with property $P$ will lead models to 'understand' (scare-quotes omitted from now on...) both $P$ and not $P$ better, thanks. I see better where the post is coming from.
However, I don't necessarily think that we get the easier elicitation of not $P$. There are reasons to believe finetuning is simply resteering the base model and not changing its understanding at all. For example, there are far more training steps in pretraining vs. finetuning. Even if finetuning is shaping a model's understanding of $P$, in an RLHF setup you're generally seeing two responses, one with less $P$ and one with more $P$, and I'm not sure that I buy that the model's inclination to output not $P$ responses can increase given there are no gradients from not $P$ cases. There are in red-teaming setups though and I think the author should register predictions in advance and then blind test various base models and finetuned models for the Waluigi Effect.
It is open sourced here and there is material from REMIX to get used to the codebase here
The Waluigi Effect: After you train an LLM to satisfy a desirable property , then it's easier to elicit the chatbot into satisfying the exact opposite of property .
I've tried several times to engage with this claim, but it remains dubious to me and I didn't find the croissant example enlightening.
Firstly, I think there is weak evidence that training on properties makes opposite behavior easier to elicit. I believe this claim is largely based on the bing chat story, which may have these properties due to bad finetuning rather than because these finetuning methods cause the Waluigi effect. I think ChatGPT is an example of finetuning making these models more robust to prompt attacks (example).
Secondly (and relatedly) I don't think this article does enough to disentangle the effect of capability gains from the Waluigi effect. As models become more capable both in pretraining (understanding subtleties in language better) and in finetuning (lowering the barrier of entry for the prompting required to get useful outputs), they will get better at being jailbroken by stranger prompts.
Does the “ground truth” shows the correct label function on 100% of the training and test data? If so, what’s the relevance of the transformer which imperfectly implements the label function?
This is because of tokenization. Tutorial about BPE (which OpenAI use) is here. Specifically in this case:

It seems like a cached speech from him. He echoes the same words at the Oxford Union earlier this month. I'm unsure how much this needs updating on. He constantly pauses and is occasionally inflammatory so my impression was he was measuring his words carefully for the audience.
You can observe "double descent" of test loss curves in the grokking setting, and there is "grokking" of test set performance as model dimension is increased, as this paper points out
This post implies that we may be able to extrapolate the log-prob likelihood of deception, situational awareness, or power-seeking behavior in future models.
Do you expect that models' stated desires for these behaviors are the metrics that we should measure? Or that when models become more powerful, the metric to use will be obvious? I generally agree with the observation that capabilities tend to improve more smoothly than early alignment work predicted, but I am skeptical that we will know what is relevant to measure.
The criticism here implies that one of the most important factors in modelling the end of Moore's law is whether we're running out of ideas (which the poster thinks is true). Do you think your models capture the availability of new ideas?
On one hand wikipedia suggests Jewish astronomers saw the three tail stars as cubs. But at the same time, it suggests several ancient civilizations independently saw Ursa Major as a bear. Also confused.
I would recommend reading the top comment on https://forum.effectivealtruism.org/posts/WPEwFS5KR3LSqXd5Z/why-do-we-post-our-ai-safety-plans-on-the-internet
Yann LeCun contra instrumental convergence
EDIT: oops!
How is "The object is" -> " a" or " an" a case where models may show non-myopic behavior? Loss will depend on the prediction of " a" or " an". It will also depend on the completion of "The object is an" or "The object is a", depending on which appears in the current training sample. AFAICT the model will just optimize next token predictions, in both cases...?
I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.
If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.
Not sure if you're aware, but yes the model has a hidden prompt that says it is ChatGPT, and browsing is disabled.
I think the situation I'm considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of "why SGD works" (the model uses the Y components to achieve low loss).
I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from "why SGD works".
To me, the label "Science of DL" is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).