Posts
Comments
It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.
My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.
I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.
Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?
Makes sense, thanks!
For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.
It's hard to say because I'm not even sure you can rent Titan Vs at this point,[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.
An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than a Titan V, in that it has 40 or 80 GB of memory and (pulling number out of thin air) 4-5x faster.
So if o1 costs $2 per task and it's 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)
- ^
I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.
This is really impressive -- could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?
(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)
(Disclaimer: have not read the piece in full)
If “reasoning models” count as a breakthrough of the relevant size, then I argue that there’s been quite a few of these in the last 10 years: skip connections/residual stream (2015-ish), transformers instead of RNNs (2017), RLHF/modern policy gradient methods (2017ish), scaling hypothesis (2016-20 depending on the person and which paper), Chain of Thought (2022), massive MLP MoEs (2023-4), and now Reasoning RL training (2024).
I think the title greatly undersells the importance of these statements/beliefs. (I would've preferred either part of your quote or a call to action.)
I'm glad that Sam is putting in writing what many people talk about. People should read it and take them seriously.
Nit:
> OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.
Should this say Christmas?
I think writing this post was helpful to me in thinking through my career options. I've also been told by others that the post was quite valuable to them as an example of someone thinking through their career options.
Interestingly, I left METR (then ARC Evals) about a month and a half after this post was published. (I continued to be involved with the LTFF.) I then rejoined METR in August 2024. In between, I worked on ambitious mech interp and did some late stage project management and paper writing (including some for METR). I also organized a mech interp workshop at ICML 2024, which, if you squint, counts as "onboarding senior academics".
I think leaving METR was a mistake ex post, even if it made sense ex ante. I think my ideas around mech interp when I wrote this post weren't that great, even if I thought the projects I ended up working on were interesting (see e.g. Compact Proofs and Computation in Superposition). While the mech interp workshop was very well attended (e.g. the room was so crowded that people couldn't get in due to fire code) and pretty well received, I'm not sure how much value it ended up producing for AIS. Also, I think I was undervaluing the resources available to METR as well as how much I could do at METR.
If I were to make a list for myself in 2023 using what I know now, I'd probably have replaced "onboarding senior academics" with "get involved in AI policy via the AISIs", and instead of "writing blog posts or takes in general", I'd have the option of "build common knowledge in AIS via pedagogical posts". Though realistically, knowing what I know now, I'd have told my past self to try to better leverage my position at METR (and provided him with a list of projects to do at METR) instead of leaving.
Also, I regret both that I called it "ambitious mech interp", and that this post became the primary reference for what this term meant. I should've used a more value-neutral name such as "rigorous model internals" and wrote up a separate post describing it.
I think this post made an important point that's still relevant to this day.
If anything, this post is more relevant in late 2024 than in early 2023, as the pace of AI makes ever more people want to be involved, while more and more mentors have moved towards doing object level work. Due to the relative reduction of capacity in evaluating new AIS researcher, there's more reliance on systems or heuristics to evaluate people now than in early 2023.
Also, I find it amusing that without the parenthetical, the title of the post makes another important point: "evals are noisy".
I think this post was useful in the context it was written in and has held up relatively well. However, I wouldn't active recommend it to anyone as of Dec 2024 -- both because the ethos of the AIS community has shifted, making posts like this less necessary, and because many other "how to do research" posts were written that contain the same advice.
Background
This post was inspired by conversations I had in mid-late 2022 with MATS mentees, REMIX participants, and various bright young people who were coming to the Bay to work on AIS (collectively, "kiddos"). The median kiddo I spoke with had read a small number of ML papers and a medium amount of LW/AF content, and was trying to string together an ambitious research project from several research ideas they recently learned about. (Or, sometimes they were assigned such a project by their mentors in MATS or REMIX.)
Unfortunately, I don't think modern machine learning is the kind of field where you can take several where research consistently works out of the box. Many high level claims even in published research papers are just... wrong, it can be challenging to reproduce results even when they are right, and even techniques that work reliably may not work for the reasons people think they do.
Hence, this post.
What do I think of the content of the post?
I think the core idea of this post held up pretty well with time. I continue to think that making contact with reality is very important, and I think the concrete suggestions for how to make contact with reality are still pretty good.
If I were to write it today, I'd probably add a fifth major reason for why it's important to make quick contact with reality: mental health/motivation. That is, producing concrete research outputs, even small ones, feels pretty essential to maintaining motivation for the vast majority of researchers. My guess is I missed this factor because I focused on the content of research projects, as opposed to the people doing the research.
Where do I feel the post stands now?
Over the past two years, the ethos of the AIS community has changed substantially toward empirical work, over the past two years, and especially in 2024.
The biggest part of this is because of the pace of AI. When this post was written, ChatGPT was a month old, and GPT-4 was still more than 2 months away. People both had longer timelines and thought of AIS in more conceptual terms. Many research conceptual research projects of 2022 have fallen into the realm of the empirical as of late 2024.
Part of this is due to the rise of (dangerous capability) evals as a major AIS focus in 2023, which is both substantially more empirical compared to the median 2022 AIS research topic, and an area where making contact with reality can be as simple as "pasting a prompt into claude.ai".
Part of this is due to Anthropic's rise to being the central place for AIS researchers. "Being able to quickly produce ML results" is a major part of what it takes to get hired there as a junior researcher, and people know this.
Finally, there's been a decent amount of posts or write-ups giving the same advice, e.g. Neel's written advice for his MATS scholars and a recent Alignment Forum post by Ethan Perez.
As a result, this post feels much less necessary or relevant in late December 2024 than in December 2022.
Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic)
I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.
It seems plausible that anthropic’s scaring laws project was already under work before then (and this is what they’re referring to, but proliferating QA datasets feels qualitatively than DC Evals). Also, they were definitely doing other red teaming, just none that seem to be DC Evals
Otherwise, we could easily in the future release a model that is actually (without loss of generality) High in Cybersecurity or Model Autonomy, or much stronger at assisting with AI R&D, with only modest adjustments, without realizing that we are doing this. That could be a large or even fatal mistake, especially if circumstances would not allow the mistake to be taken back. We need to fix this.
[..]
This is a lower bound, not an upper bound. But what you need, when determining whether a model is safe, is an upper bound! So what do we do?
Part of the problem is the classic problem with model evaluations: elicitation efforts, by default, only ever provide existence proofs and rarely if ever provide completeness proofs. A prompt that causes the model to achieve a task provides strong evidence of model capability, but the space of reasonable prompts is far too vast to search exhaustively to truly demonstrate mode incapability. Model incapability arguments generally rely on an implicit "we've tried as hard at elicitation as would be feasible post deployment", but this is almost certainly not going to be the case, given the scale of pre-deployment evaluations vs post-deployment use cases.
The way you get a reasonable upper bound pre-deployment is by providing pre-deployment evaluators with some advantage over end-users, for example by using a model that's not refusal trained or by allowing for small amounts of finetuning. OpenAI did do this in their original preparedness team bio evals; specifically, they provided experts with non--refusal fine-tuned models. But it's quite rare to see substantial advantages given to pre-deployment evaluators for a variety of practical and economic reasons, and in-house usage likely predates pre-deployment capability/safety evaluations anyways.
Re: the METR evaluations on o1.
We'll be releasing more details of our evaluations of the o1 model we evaluated, in the same style of our blog posts for o1-preview and Claude 3.5 Sonnet (Old). This includes both more details on the general autonomy capability evaluations as well as AI R&D results on RE-Bench.
Whereas the METR evaluation, presumably using final o1, was rather scary.
[..]
From the performance they got, I assume they were working with the full o1, but from the wording it is unclear that they got access to o1 pro?
Our evaluations were not on the released o1 (nor o1-pro); instead, we were provided with an earlier checkpoint of o1 (this is in the system card as well). You're correct that we were working with a variant of o1 and not o1-mini, though.
If 70% of all observed failures are essentially spurious, then removing even some of those would be a big leap – and if you don’t even know how the tool-use formats work and that’s causing the failures, then that’s super easy to fix.
While I agree with the overall point (o1 is a very capable model whose capabilities are hard to upper bound), our criteria for "spurious" is rather broad, and includes many issues that we don't expect to be super easy to fix with only small scaffolding changes. In experiments with previous models, I'd say 50% of issues we classify as spurious are fixable with small amounts of effort.
Which is all to say, this may look disappointing, but it is still a rather big jump after even minor tuning
Worth noting that this was similar to our experiences w/ o1-preview, where we saw substantial improvements on agentic performance with only a few days of human effort.
I am worried that issues of this type will cause systematic underestimates of the agent capabilities of new models that are tested, potentially quite large underestimates.
Broadly agree with this point -- while we haven't seen groundbreaking advancements due to better scaffolding, there have been substantial improvements to o1-preview's coding abilities post-release via agent scaffolds such as AIDE. I (personally) expect to see comparable increases for o1 and o1-pro in the coming months.
This is really good, thanks so much for writing it!
I've never heard of Whisper or Eleven labs until today, and I'm excited to try them out.
Yeah, this has been my experience using Grammarly pro as well.
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.
I mean, we don't know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + "high-quality multi-task instruction data". I wouldn't be surprised if the same were true of Qwen 1.5.
From the Qwen2 report:
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
Similarly, Gemma 2 had its pretraining corpus filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above.
It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.
It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen 1.5 to demonstrate refusals -- neither is trained on unfilted webtext.
We don't know what 1.5 was trained on, but we do know that Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
I think this had a huge effect on Qwen2: Qwen2 is able to reliably follow both the Qwen1.5 chat template (as you note) as well as the "User: {Prompt}\n\nAssistant: " template. This is also reflected in their high standardized benchmark scores -- the "base" models do comparably to the instruction finetuned ones! In other words, Qwen2 "base" models are pretty far from traditional base models a la GPT-2 or Pythia as a result of explicit choices made when generating their pretraining data and this explains its propensity for refusals. I wouldn't be surprised if the same were true of the 1.5 models.
I think the Gemma 2 base models were not trained on synthetic data from larger models but its pretraining dataset was also filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
My guess is this filtering explains why the model refuses, moreso than (and in addition to?) ChatGPT contamination. Once you remove all the "unsafe completions"
I don't know what's going on with LLaMA 1, though.
I'm down.
Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/
(The Anthropic paper I cited predates ChatGPT by 7 months)
Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862
But yes, people complained about it a lot at the time
Thanks for the summaries, I found them quite useful and they've caused me to probably read some of these books soon. The following ones are both new to me and seem worth thinking more about:
- You should judge a person's performance based on the performance of the ideal person that would hold their position
- Document every task you do more than once, as soon as you do it the second time.
- Fun is important. (yes, really)
- People should know the purpose of the organization (specifically, being able to recite a clear mission statement)
- "I’m giving you these comments because I have very high expectations and I know that you can reach them"
A question I had while reading your notes -- it seems like people fail at implementing many best practices not because they don't think the practices are good, but because of a lack of capacity. For example, there's an entire cluster of that basically boil down to "people do better with fast feedback":
- After a task is delegated, make sure that it's progressing as intended.
- After a task is completed (or failed), keep people accountable.
- Make sure to check in on goals in regular time intervals.
- Provide positive reinforcement immediately.
- Provide negative feedback immediately.
These require that managers be very attentive to the going-ons and constantly on top of the state -- but when there are other priorities, this might be pushed back. Do the books also talk about what not to do, such that you'll have the slack to implement best practices?
Also, a typo:
- Use OKRs (objectives and key results) and check if you're meeting them regularly. Switch them up often to avoid goodhearting.
goodhearting -> Goodharting
Thanks for writing this!
I think that phased testing should be used during frontier model training runs. By this, I mean a testing approach which starts off with extremely low surface area tests, and gradually increases surface area. This makes it easy to notice sudden capability gains while decreasing the likelihood that the model takes over.
I actually think the proposal is more general than just for preventing AI escapes during diverse evals -- you want to start with low surface area tests because they're cheaper anyways, and you can use the performance on low surface area tests to decide if you want to do more thorough evals.
I imagine a proper approach is something like:
- Very cheap qualification evals, such that if the model can't pass them (and isn't sandbagging) it is very unlikely to have had a capabilities jump.
- Very low-surface area safety evals -- does the model capabilities imply that it's not safe to do the thorough evals?
- Very thorough evals for measurement -- what it says on the tin.
You'd want to run 1 + 2 all the time, and begin running 3 once the model passes 1 until it passes 2.
Very cool work; I'm glad it was done.
That being said, I agree with Fabien that the title is a bit overstated, insofar as it's about your results in particular::
Thus, fine-tuned performance provides very little information about the best performance that would be achieved by a large number of actors fine-tuning models with random prompting schemes in parallel.
It's a general fact of ML that small changes in finetuning setup can greatly affect performance if you're not careful. In particular, it seems likely to me that the empirical details that Fabien asks for may affect your results. But this has little to do with formatting, and much more to deal with the intrinsic difficulty of finetuning LLMs properly.
As shown in Fabien's password experiments, there are many ways to mess up on finetuning (including by having a bad seed), and different finetuning techniques are likely to lead to different levels of performance. (And the problem gets worse as you start using RL and not just SFT) So it's worth being very careful on claiming that the results of any particular finetuning run upper bounds model capabilities. But it's still plausible that trying very hard on finetuning elicits capabilities more efficiently than trying very hard on prompting, for example, which I think is closer to what people mean when they say that finetuning is an upper bound on model capabilities.
Good work, I'm glad that people are exploring this empirically.
That being said, I'm not sure that these results tell us very much about whether or not the MCIS theory is correct. In fact, something like your results should hold as long as the following facts are true (even without superposition):
- Correct behavior: The model behavior is correct on distribution, and the correct behavior isn't super sensitive to many small variations to the input.
- Linear feature representations: The model encodes information along particular directions, and "reads-off" the information along these directions when deciding what to do.
If these are true, then I think the results you get follow:
- Activation plateaus: If the model's behavior changes a lot for actual on-distribution examples, then it's probably wrong, because there's lots of similar seeming examples (which won't lead to exactly the same activation, but will lead to similar ones) where the model should behave similarly. For example, given a fixed MMLU problem and a few different sets of 5-shot examples, the activations will likely be close but won't be the same, (as the inputs are similar and the relevant information to locating the task should be the same). But if the model performs uses the 5-shot examples to get the correct answer, its logits can't change too much as a function of the inputs.
In general, we'd expect to see plateaus around any real examples, because the correct behavior doesn't change that much as a function of small variations to the input, and the model performs well. In contrast, for activations that are very off distribution for the model, there is no real reason for the model to remain consistent across small perturbations. - Sensitive directions: Most directions in high-dimensional space are near-orthogonal, so by default random small perturbations don't change the read-off along any particular direction by very much. But if you perturb the activation along some of the read-off directions, then this will indeed change the magnitude along each of these directions a lot!
- Local optima in sensitivity: Same explanation as with sensitive directions.
Note that we don't need superposition to explain any of these results. So I don't think these results really support one model of superposition via the other, given they seem to follow from a combination of model behaving correctly and the linear representation hypothesis.
Instead, I see your results as primarily a sanity-check of your techniques for measuring activation plateaus and for measuring sensitivity to directions, as opposed to weighing in on particular theories of superposition. I'd be interested in seeing the techniques applied to other tasks, such as validating the correctness of SAE features.
This also continues the trend of OAI adding highly credentialed people who notably do not have technical AI/ML knowledge to the board.
Have you tried instead 'skinny' NNs with a bias towards depth,
I haven't -- the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot -> grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you're probably correct that having more layers does seem related to slingshots.
(Particularly for MLPs, which are notorious for overfitting due to their power.)
What do you mean by power here?
70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there's a lot more structure to the world that the models exploit to "know" more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what "memorized info" means here). That being said, SAEs as currently conceived/evaluated won't be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.
I don't think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I agree that SAEs don't work at this level of sparsity and I'm skeptical of the view myself. But from a "scale up SAEs to get all features" perspective, it sure seems pretty plausible to me that you need a lot more features than people used to look at.
I also don't think the Anthropic paper OP is talking about has come close for Pareto frontier for size <> sparsity <> trainability.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
For what it's worth, I don't have anywhere near close to ~99% P(doom), but am also in favor of a (globally enforced, hardware-inclusive) AGI scaling pause (depending on details, of course). I'm not sure about Paul or Rohin's current takes, but lots of people around me are also be in favor of this as well, including many other people who fall squarely into the non-MIRI camp with P(doom) as low as ~10-20%.
But I was quietly surprised by how many features they were using in their sparse autoencoders (respectively 1M, 4M, or 34M). Assuming Claude Sonnet has the same architecture of GPT-3, its residual stream has dimension 12K so the feature ratios are 83x, 333x, and 2833x, respectively[1]. In contrast, my team largely used a feature ratio of 2x, and Anthropic's previous work "primarily focus[ed] on a more modest 8× expansion". It does make sense to look for a lot of features, but this seemed to be worth mentioning.
There's both theoretical work (i.e. this theory work) and empirical experiments (e.g. in memorization) demonstrating that models seem to be able to "know" O(quadratically) many things, in the size of their residual stream.[1] My guess is Sonnet is closer to Llama-70b in size (~8.2k features), so this suggests ~67M features naively, and also that 34M is reasonable.
Also worth noting that a lot of their 34M features were dead, so the number of actual features is quite a bit lower.
- ^
You might also expect to need O(Param) params to recover the features, so for a 70B model with residual stream width 8.2k you want 8.5M (~=70B/8192) features.
Worth noting that both some of Anthropic's results and Lauren Greenspan's results here (assuming I understand her results correctly) give a clear demonstration of learned (even very toy) transformers not being well-modeled as sets of skip trigrams.
I'm having a bit of difficulty understanding the exact task/set up of this post, and so I have a few questions.
Here's a summary of your post as I understand it:
- In Anthropic's Toy Model of Attention Head "Superposition",[1] they consider a task where the model needs to use interference between heads to implement multiple skip trigrams. In particular, they consider, and call this "OV-incoherent, because the OV seems to need to use information "not present" in V of the source token. (This was incorrect, because you can implement their task perfectly using only a copying head and a negative copying head.) They call this attention head superposition because they didn't understand the algorithm, and so mistakenly thought they needed more attention heads than you actually need to implement the task (to their credit, they point out their mistake in their July 2023 update, and give the two head construction).
- In this work, you propose a model of "OV-coherent" superposition, where the OV still needs to use information "not present" at the attended to location and which also requires more skip trigrams than attention heads to implement. Namely, you consider learning sequences of the form [A] ... [B] ... [Readoff]-> [C], which cannot naturally be implemented via skip-trigrams (and instead needs to be implemented via what Neel calls hierarchical skip tri-grams or what I normally call just "interference").
- You construct your sequences as follows:
- There are 12 tokens for the input and 10 "output tokens". Presumably you parameterized it so that dvocab=12, and just reassigned the inputs? For the input sequence, you use one token [0] as the read-off token, 4 tokens [1-4] as signal tokens, and the rest [5-11] as noise tokens.
- In general, you don't bother training the model above all tokens except for above the read-off [0](I think it's more likely you trained it to be uniform on other tokens, actually. But at most this just rescales the OV and QK circuits (EVOU and EQKE respectively), and so we can ignore it when analyzing the attention heads).
- Above the read-off, you train the model to minimize cross entropy loss, using the labels:
- 0 -> 1,1 present in sequence
- 1 -> 1, 2 present in sequence
- ...
- 8 -> 3, 4 present in sequence
- 9 -> 4, 4 present in sequence
- So for example, if you see the sequence [5] [1] [2] [11] [0], the model should assign a high logit to [1], if you see the sequence [7] [4] [10] [3] [0], the model should assign a high logit to [8], etc.
- You find that models can indeed learn your sequences of the form [A] ... [B] ... [Readoff]-> [C], often by implementing constructive interference "between skip-trigrams" both between the two attention heads and within each single head.
- Specifically, in your mainline model in the post, head 1 implements something like the following algorithm:
- Attend to tokens in [1-4], but attend to token [1] the most, then [4], then [3], then [2]. Call this the order of head 1.
- The head increases the logits corresponding to pairs containing the tokens it attends to, except for the pairs that contain tokens higher in the order. That is, when attending to token [1], increase the logits for outputs [0-3] (corresponding to the logits indicating that there's a 1 present in the sequence) and decrease logits for outputs [4-9] (corresponding to all other logits). Similarly, when attending to 4, increase the logits for outputs [6], [8], and [9] (corresponding to logits indicating that there's a 4 present but not a 1). When attending to 3, increase logits for outputs [5] and [7] (there's a 3 but not a 1 or 4), and when attending to 2, increase logits for outputs [2], [3], [4]. In fact, it increases the logits that it attends to less strongly more, which partially cancels out the fact that it attends more to those logits.
- So on the sequence [7] [4] [10] [3] [0], head 1 will increase the logits for [6], [8], [9] a lot and [5] and [7] a little, while suppressing all other logits.
- Head 0 implements the same algorithm, but attends in order [2], [3], [4], [1] (the reverse of head 1).
- That being said, it's a lot less clean in terms of what it outputs, e.g. it slightly increases logits [7-9] if it sees a 1. This is probably for error correction/calibration reasons, increasing logits [7-9] helps cancel out the strong bias of head 1 in suppressing logits of [5-9].
- On the sequence [7] [4] [10] [3] [0], head 0 increases the logits for [2], [7], [8] a lot and [3] 9] a little.
- Adding together the two heads causes them to output the correct answer.
- On the sequence [7] [4] [10] [3] [0], since both heads increase logit [8] a lot, and increase the other logits only a little, the model outputs [8] (corresponding to 3, 4 being in sequence).
- Specifically, in your mainline model in the post, head 1 implements something like the following algorithm:
- You conclude that this is an example of a different kind of "attention head superposition", because this task is implemented across two attention heads, even though it takes 10 skip trigrams to naively implement this task.
Questions/comments:
- I'm not sure my understanding of the task is correct, does the description above seem right to you?
- Assuming the description above is correct, it seems that there's an easy algorithm for implementing this with one head.
- When you see a token, increase the logits corresponding to pairs containing that token. Then, attend to all tokens in [1-4] uniformly.
- You can explain this with skip-bigrams -- the model needs to implement the 16 skip bigrams mapping each of 4 tokens to the 4 logits corresponding to a pair containing the token.
- You need a slight correction to handle the case where there are two repeated tokens, so you in fact want to increase the logits non-uniformly, so as to assign slightly higher logits to the pair containing the attended to token twice.
- though, if you trained the model to be uniform on all tokens except for [0], it'll need to check for [0] when deciding to output a non-uniform logit and move this information from other tokens, so it needs to stash its "bigrams" in EVOU and not EU
- It's pretty easy to implement 16 skip-bigrams in a matrix of size 4 x 10 (you only need 16 non-zero entries out of 40 total entries). You want EVOU to look something like:
3 2 2 2 0 0 0 0 0 0
0 2 0 0 3 2 2 0 0 0
0 0 2 0 0 2 0 3 2 0
0 0 0 2 0 0 2 0 2 3
Then with EQKE uniform on[1-4] and 0 otherwise, the output of the head (attention-weighted EVOU) in cases where there are two different tokens in the input will be 4 on the true logit and 2 or 3 on the logits for pairs containing one of the tokens but not the other, and 0 on other bigrams. In cases where the same token appears twice, then you get 6 on the true logit, 4 on the three other pairs containing the token once, and 0 otherwise.[2] You can then scale EVOU upwards to decrease loss until weight decay kicks in. - In your case, you have two EVOUs of size 4 x 10 but which are constrained to be rank 5 due to d_head=5. This is part of why the model wants to split the computation evenly across both heads.
- From eyeballing, adding together the two EVOUs indeed produces something akin to the above diagram.
- Given you split the computation and the fact that EVOU being rank 5 for each head introduces non-zero bias/noise, you want the two heads to have opposite biases/noise terms such that they cancel out. This is why you see one head specializing in copying over 1, then 4, then 3, then 2, and the other 2 3 4 1.
- This also explains your observation: "We were also surprised that this problem can be solved with one head, as long as d_head >= 4. Intuitively, once a head has enough dimensions to store every "interesting" token orthogonally, its OV circuit can simply learn to map each of these basis vectors to the corresponding completions."
- It makes sense why d_head >= 4 is required here, because you definitely cannot implement anything approaching the above EVOU with a rank 3 matrix (since you can't even "tell apart" the 4 input tokens). Presumably the model can learn low-rank approximations of the above EVOU, though I don't know how to construct them by hand.
- So it seems to me that, if my understanding is correct, this is also not an example of "true" superposition, in the sense I distinguish here: https://www.lesswrong.com/posts/8EyCQKuWo6swZpagS/superposition-is-not-just-neuron-polysemanticity
- When you see a token, increase the logits corresponding to pairs containing that token. Then, attend to all tokens in [1-4] uniformly.
- What exactly do you mean by superposition?
- It feels that you're using the term interchangeably with "polysemanticity" or "decomposability". But part of the challenge of superposition is that there are more sparse "things" the model wants to compute or store than it has "dimensions"/"components", which means there's no linear transformation of the input space that recovers all the features. This is meaningfully distinct from the case where the model wants to represent one thing across multiple components/dimensions for error correction or other computational efficiency reasons(i.e. see example 1 here), which are generally easier to handle using linear algebra techniques.
- It feels like you're claiming superposition because there are more skip trigrams than n_heads, is there a different kind of superposition I'm missing here?
- I think your example in the post is not an example of superposition in the traditional sense (again assuming that my interpretation is correct), and is in fact not even true polysemanticity. Instead of each head representing >1 feature, the low-rank nature of your heads means that each head basically has to represent 0.5 features.
- The example in the post is an example of superposition of skip trigrams, but it's pretty easy to construct toy examples where any -- would you consider any example where you can't represent the task with <= nheads skip trigrams as an example of superposition?
Some nitpicks:
- What is "nan" in the EVOU figure (in the chapter "OV circuit behaviour")? I presume this is the (log-)sum(-exp) of the logits corresponding to outputs [9] and [10]?
- It's worth noting that (I'm pretty sure though I haven't sat down to write the proof) as softmax attention is a non-polynomial function of inputs, 1-layer transformers with unbounded number of heads can implement arbitrary functions of the inputs. On the other hand, skip n-grams for any fixed n obviously are not universal (i.e. they can't implement XOR, as in the example of the "1-layer transformers =/= skip trigrams post). So even theoretically (without constructing any examples), it seems unlikely that you should think of 1L transformers as only skip-trigrams, though whether or not this occurs often in real networks is an empirical question (to which I'm pretty sure the answer is yes, because e.g. copy suppression heads are a common motif).
- ^
Scare quotes are here because their example is really disanalogous to MLP superposition. IE as they point out in their second post, their task is well thought of as naturally being decomposed into two attention heads; and a model that has n >= 2 heads isn't really "placing circuits in superposition" so much as doing a natural task decomposition that they didn't think of.
In fact, it feels like that result is a cautionary tale that just because a model implements an algorithm in a non-basis aligned manner, does not mean the model is implementing an approximate algorithm that requires exploiting near-orthogonality in high-dimensionality space (the traditional kind of residual stream/MLP activation superposition), nor does it mean that the algorithm is "implementing more circuits than is feasible" (i.e. the sense that they try to construct in the May 2023 update). You might just not understand the algorithm the model is implementing!
If I were to speculate more, it seems like they were screwed over by continuing to think about one-layer attention model as a set of skip trigrams, which they are not. More poetically, if your "natural" basis isn't natural, then of course your model won't use your "natural" basis.
- ^
Note that this construction isn't optimal, in part because of the fact that output tokens corresponding to the same token occuring twice occur half as often as those with two different tokens, while this construction gets lower log loss in the one-token case as in the two distinct token case. But the qualitative analysis carries through regardless.
Yeah, it's been a bit of a meme ("where is Ilya?"). See e.g. Gwern's comment thread here.
What does a "majority of the EA community" mean here? Does it mean that people who work at OAI (even on superalignment or preparedness) are shunned from professional EA events? Does it mean that when they ask, people tell them not to join OAI? And who counts as "in the EA community"?
I don't think it's that constructive to bar people from all or even most EA events just because they work at OAI, even if there's a decent amount of consensus people should not work there. Of course, it's fine to host events (even professional ones!) that don't invite OAI people (or Anthropic people, or METR people, or FAR AI people, etc), and they do happen, but I don't feel like barring people from EAG or e.g. Constellation just because they work at OAI would help make the case, (not that there's any chance of this happening in the near term) and would most likely backfire.
I think that currently, many people (at least in the Berkeley EA/AIS community) will tell you to not join OAI if asked. I'm not sure if they form a majority in terms of absolute numbers, but they're at least a majority in some professional circles (e.g. both most people at FAR/FAR Labs and at Lightcone/Lighthaven would probably say this). I also think many people would say that on the margin, too many people are trying to join OAI rather than other important jobs. (Due to factors like OAI paying a lot more than non-scaling lab jobs/having more legible prestige.)
Empirically, it sure seems significantly more people around here join Anthropic than OAI, despite Anthropic being a significantly smaller company.
Though I think almost none of these people would advocate for ~0 x-risk motivated people to work at OAI, only that the marginal x-risk concerned technical person should not work at OAI.
What specific actions are you hoping for here, that would cause you to say "yes, the majority of EA people say 'it's better to not work at OAI'"?
To be honest, I would've preferred if Thomas's post started from empirical evidence (e.g. it sure seems like superforecasters and markets change a lot week on week) and then explained it in terms of the random walk/Brownian motion setup. I think the specific math details (a lot of which don't affect the qualitative result of "you do lots and lots of little updates, if there exists lots of evidence that might update you a little") are a distraction from the qualitative takeaway.
A fancier way of putting it is: the math of "your belief should satisfy conservation of expected evidence" is a description of how the beliefs of an efficient and calibrated agent should look, and examples like his suggest it's quite reasonable for these agents to do a lot of updating. But the example is not by itself necessarily a prescription for how your belief updating should feel like from the inside (as a human who is far from efficient or perfectly calibrated). I find the empirical questions of "does the math seem to apply in practice" and "therefore, should you try to update more often" (e.g., what do the best forecasters seem to do?) to be larger and more interesting than the "a priori, is this a 100% correct model" question.
Technically, the probability assigned to a hypothesis over time should be the martingale (i.e. have expected change zero); this is just a restatement of the conservation of expected evidence/law of total expectation.
The random walk model that Thomas proposes is a simple model that illustrates a more general fact. For a martingale, the variance of is equal to the sum of variances of the individual timestep changes (and setting ): . Under this frame, insofar as small updates contribute a large amount to the variance of each update , then the contribution to the small updates to the credences must also be large (which in turn means you need to have a lot of them in expectation[1]).
Note that this does not require any strong assumption besides that the the distribution of likely updates is such that the small updates contribute substantially to the variance. If the structure of the problem you're trying to address allows for enough small updates (relative to large ones) at each timestep, then it must allow for "enough" of these small updates in the sequence, in expectation.
While the specific +1/-1 random walk he picks is probably not what most realistic credences over time actually look like, playing around with it still helps give a sense of what exactly "conservation of expected evidence" might look/feel like. (In fact, in the dath ilan of Swimmer's medical dath ilan glowfics, people do use a binary random walk to illustrate how calibrated beliefs typically evolve over time.)
Now, in terms of if it's reasonable to model beliefs as Brownian motion (in the standard mathematical sense, not in the colloquial sense): if you suppose that there are many, many tiny independent additive updates to your credence in a hypothesis, your credence over time "should" look like Brownian motion at a large enough scale (again in the standard mathematical sense), for similar reasons as to why the sum of a bunch of independent random variables converges to a Gaussian. This doesn't imply that your belief in practice should always look like Brownian motion, any more than the CLT implies that real world observables are always Gaussian. But again, the claim Thomas makes carries thorough
I also make the following analogy in my head: Bernouli:Gaussian ~= Simple Random Walk:Brownian Motion, which I found somewhat helpful. Things irl are rarely independent/time-invarying Bernoulli or Gaussian processes, but they're mathematically convenient to work with, and are often 'good enough' for deriving qualitative insights.
- ^
Note that you need to apply something like the optional stopping theorem to go from the case of for fixed to the case of where is the time you reach 0 or 1 credence and the updates stop.
Huh, that's indeed somewhat surprising if the SAE features are capturing the things that matter to CLIP (in that they reduce loss) and only those things, as opposed to "salient directions of variation in the data". I'm curious exactly what "failing to work" means -- here I think the negative result (and the exact details of said result) are argubaly more interesting than a positive result would be.
The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going up or down by 1% at each step, and 1-p of staying the same, the variance is reduced by a factor of p, and so you need 2500/p steps.
(Indeed, something like this standard way to derive the expected steps before a random walk hits an absorbing barrier).
Similarly, you get that if you start at 20% or 80%, you need 1600 steps in expectation, and if you start at 1% or 99%, you'll need 99 steps in expectation.
One problem with your reasoning above is that as the 1%/99% shows, needing 99 steps in expectation does not mean you will take 99 steps with high probability -- in this case, there's a 50% chance you need only one update before you're certain (!), there's just a tail of very long sequences. In general, the expected value of variables need not look like
I also think you're underrating how much the math changes when your beliefs do not come in the form of uniform updates. In the most extreme case, suppose your current 50% doom number comes from imagining that doom is uniformly distributed over the next 10 years, and zero after -- then the median update size per week is only 0.5/520 ~= 0.096%/week, and the expected number of weeks with a >1% update is 0.5 (it only happens when you observe doom). Even if we buy a time-invariant random walk model of belief updating, as the expected size of your updates get larger, you also expect there to be quadratically fewer of them -- e.g. if your updates came in increments of size 0.1 instead of 0.01, you'd expect only 25 such updates!
Applying stochastic process-style reasoning to beliefs is empirically very tricky, and results can vary a lot based on seemingly reasonable assumptions. E.g. I remember Taleb making a bunch of mathematically sophisticated arguments[2] that began with "Let your beliefs take the form of a Wiener process[3]" and then ending with an absurd conclusion, such as that 538's forecasts are obviously wrong because their updates aren't Gaussian distributed or aren't around 50% until immediately before the elction date. And famously, reasoning of this kind has often been an absolute terrible idea in financial markets. So I'm pretty skeptical of claims of this kind in general.
- ^
There's some regularity conditions here, but calibrated beliefs that things you eventually learn the truth/falsity of should satisfy these by default.
- ^
Often in an attempt to Euler people who do forecasting work but aren't super mathematical, like Philip Tetlock.
- ^
This is what happens when you take the limit of the discrete time random walk, as you allow for updates on ever smaller time increments. You get Gaussian distributed increments per unit time -- W_t+u - W_t ~ N(0, u) -- and since the tail of your updates is very thin, you continue to get qualitatively similar results to your discrete-time random walk model above.
And yes, it is ironic that Taleb, who correctly points out the folly of normality assumptions repeatedly, often defaults to making normality assumptions in his own work.
When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn't.
Also, another nitpick:
Humane vs human values
I think there's a harder version of the value alignment problem, where the question looks like, "what's the right goals/task spec to put inside a sovereign ai that will take over the universe". You probably don't want this sovereign AI to adopt the value of any particular human, or even modern humanity as a whole, so you need to do some Ambitious Value Learning/moral philosophy and not just intent alignment. In this scenario, the distinction between humane and human values does matter. (In fact, you can find people like Stuart Russell emphasizing this point a bunch.) Unfortunately, it seems that ambitious value learning is really hard, and the AIs are coming really fast, and also it doesn't seem necessary to prevent x-risk, so...
Most people in AIS are trying to solve a significantly less ambitious version of this problem: just try to get an AI that will reliably try to do what a human wants it to do (i.e. intent alignment). In this case, we're explicitly punting the ambitious value learning problem down the line. Here, we're basically not talking about the problem of having an AI learn humane values, but instead the problem of having it "do what its user wants" (i.e. "human values" or "the technical alignment problem" in Nicky's dichotomy). So it's actually pretty accurate to say that a lot of alignment is trying to align AIs wrt "human values", even if a lot of the motivation is trying to eventually make AIs that have "humane values".[1] (And it's worth noting that making an AI that's robustly intent aligned sure seems require tackling a lot of the 'intuition'-derived problems you bring up already!)
uh, that being said, I'm not sure your framing isn't just ... better anyways? Like, Stuart seems to have lots of success talking to people about assistance games, even if it doesn't faithfully represent what a majority field thinks is the highest priority thing to work on. So I'm not sure if me pointing this out actually helps anyone here?
- ^
Of course, you need an argument that "making AIs aligned with user intent" eventually leads to "AIs with humane values", but I think the straightforward argument goes through -- i.e. it seems that a lot of the immediate risk comes from AIs that aren't doing what their users intended, and having AIs that are aligned with user intent seems really helpful for tackling the tricky ambitious value learning problem.
Also, I added another sentence trying to clarify what I meant at the end of the paragraph, sorry for the confusion.
No, I'm saying that "adding 'logic' to AIs" doesn't (currently) look like "figure out how to integrate insights from expert systems/explicit bayesian inference into deep learning", it looks like "use deep learning to nudge the AI toward being better at explicit reasoning by making small changes to the training setup". The standard "deep learning needs to include more logic" take generally assumes that you need to add the logic/GOFAI juice in explicitly, while in practice people do a slightly different RL or supervised finetuning setup instead.
(EDITED to add: so while I do agree that "LMs are bad at the things humans do with 'logic' and good at 'intuition' is a decent heuristic, I think the distinction that we're talking about here is instead about the transparency of thought processes/"how the thing works" and not about if the thing itself is doing explicit or implicit reasoning. Do note that this is a nitpick (as the section header says) that's mainly about framing and not about the core content of the post.)
That being said, I'll still respond to your other point:
Chain of thought is a wonderful thing, it clears a space where the model will just earnestly confess its inner thoughts and plans in a way that isn't subject to training pressure, and so it, in most ways, can't learn to be deceptive about it.
I agree that models with CoT (in faithful, human-understandable English) are more interpretable than models that do all their reasoning internally. And obviously I can't really argue against CoT being helpful in practice; it's one of the clear baselines for eliciting capabilities.
But I suspect you're making a distinction about "CoT" that is actually mainly about supervised finetuning vs RL, and not a benefit about CoT in particular. If the CoT comes from pretraining or supervised fine-tuning, the ~myopic next-token-prediction objective indeed does not apply much if training pressure in the relevant ways.[1] Once you start doing any outcome-based supervision (i.e. RL) without good regularization, I think the story for CoT looks less clear. And the techniques people use for improving CoT tend to involve upweighting entire trajectories based on their reward (RLHF/RLAIF with your favorite RL algorithm) which do incentivize playing the training game unless you're very careful with your fine-tuning.
(EDITED to add: Or maybe the claim is, if you do CoT on a 'secret' scratchpad (i.e. one that you never look at when evaluating or training the model), then this would by default produce more interpretable thought processes?)
- ^
I'm not sure this is true in the limit (e.g. it seems plausible to me that the Solomonoff prior is malign). But it's most likely true in the next few years and plausibly true in all practical cases that we might consider.
I think this is really quite good, and went into way more detail than I thought it would. Basically my only complaints on the intro/part 1 are some terminology and historical nitpicks. I also appreciate the fact that Nicky just wrote out her views on AIS, even if they're not always the most standard ones or other people dislike them (e.g. pointing at the various divisions within AIS, and the awkward tension between "capabilities" and "safety").
I found the inclusion of a flashcard review applet for each section super interesting. My guess is it probably won't see much use, and I feel like this is the wrong genre of post for flashcards.[1] But I'm still glad this is being tried, and I'm curious to see how useful/annoying other people find it.
I'm looking forward to parts two and three.
Nitpicks:[2]
Logic vs Intuition:
I think "logic vs intuition" frame feels like it's pointing at a real thing, but it seems somewhat off. I would probably describe the gap as explicit vs implicit or legible and illegible reasoning (I guess, if that's how you define logic and intuition, it works out?).
Mainly because I'm really skeptical of claims of the form "to make a big advance in/to make AGI from deep learning, just add some explicit reasoning". People have made claims of this form for as long as deep learning has been a thing. Not only have these claims basically never panned out historically, these days "adding logic" often means "train the model harder and include more CoT/code in its training data" or "finetune the model to use an external reasoning aide", and not "replace parts of the neural network with human-understandable algorithms". (EDIT for clarity: That is, I'm skeptical of claims that what's needed to 'fix' deep learning is by explicitly implementing your favorite GOFAI techniques, in part because successful attempts to get AIs to do more explicit reasoning look less like hard-coding in a GOFAI technique and more like other deep learning things.)
I also think this framing mixes together "problems of game theory/high-level agent modeling/outer alignment vs problems of goal misgeneralization/lack of robustness/lack of transparency" and "the kind of AI people did 20-30 years ago" vs "the kind of AI people do now".
This model of logic and intuition (as something to be "unified") is quite similar to a frame of the alignment problem that's common in academia. Namely, our AIs used to be written with known algorithms (so we can prove that the algorithm is "correct" in some sense) and performed only explicit reasoning (so we can inspect the reasoning that led to a decision, albeit often not in anything close to real time). But now it seems like most of the "oomph" comes from learned components of systems such as generative LMs or ViTs, i.e. "intuition". The "goal" is to a provably* safe AI, that can use the "oomph" from deep learning while having enough transparency/explicit enough thought processes. (Though, as in the quote from Bengio in Part 1, sometimes this also gets mixed in with capabilities, and become how AIs without interpretable thoughts won't be competent.)
Has AI had a clean "swap" between Logic and Intuition in 2000?
To be clear, Nicky clarifies in Part 1 that this model is an oversimplification. But as a nitpick, I think if you had to pick a date, I'd probably pick 2012, when a conv net won the ImageNet 2012 competition in a dominant matter, and not 2000.
Even more of a nitpick, but the examples seem pretty cherry picked?
For example, Nicky uses the example of deep blue defeating kasparov as an example of a "logic" based AI. But in that case, almost all Chess AIs are still pretty much logic based. Using Stockfish as an example, Stockfish 16's explicit alpha-beta search both is using a reasoning algorithm that we can understand, and does the reasoning "in the open". Its neural network eval function is doing (a small amount of) illegible reasoning. While part of the reasoning has become illegible, we can still examine the outputs of the alpha-beta search to understand why certain moves are good/bad. (But fair, this might be by far the most widely known non-deep learning "AI". The only other examples I can think of are Watson and recommender systems, but those were still using statistical learning techniques. I guess if you count MYCIN or SHRDLU or ELIZA...?)
(And modern diffusion models being unable to count or spell seem like a pathology specific to that class of generative model, and not say, Claude Opus.)
FOOM vs Exponential vs Steady Takeoff
Ryan already mentioned this in his comment.
Even less important and more nitpicky nitpicks:
When did AIs get better than humans (at ImageNet)?
In footnote [3], Nicky writes:
In 1997, IBM's Deep Blue beat Garry Kasparov, the then-world chess champion. Yet, over a decade later in 2013, the best machine vision AI was only 57.5% accurate at classifying images. It was only until 2021, three years ago, that AI hit 95%+ accuracy.
But humans do not get 95% top-1 accuracy[3] on imagenet! If you consult this paper from the imagenet creators (https://arxiv.org/abs/1409.0575), they note that:
. We found the task of annotating images with one of 1000 categories to be an extremely challenging task for an untrained annotator. The most common error that an untrained annotator is susceptible to is a failure to consider a relevant class as a possible label because they are unaware of its existence. (Page 31)
And even when using an human expert annotators, who did hundreds of validation image for practice, the human annotator still got a top-5 error of 5.1%, which was surpassed in 2015 by the original resnet paper (https://arxiv.org/abs/1512.03385) at 4.49% for ResNet 14 (and 3.57% for an ensemble of six resnets).
(Also, good top-1 performance on imagenet is genuinely hard and may be unrepresentative of actually being good at vision, whatever that means Take a look at some of the "mistakes" current models make:)
- ^
Using flashcards suggests that you want to memorize the concepts. But a lot of this piece isn't so much an explainer of AI safety, but instead an argument for the importance of AI Safety. Insofar as the reader is not here to learn a bunch of new terms, but instead to reason about whether AIS is a real issue, it feels like flashcards are more of a distraction than an aid.
- ^
I'm writing this in part because I at some point promised Nicky longform feedback on her explainer, but uh, never got around to it until now. Whoops.
- ^
Top-K accuracy = you guess K labels, and are right if any of them are correct. Top 5 is significantly easier on image net than Top 1, because there's a bunch of very similar classes and many images are ambiguous.
I agree with many of the points made in this post, especially the "But my ideas/insights/research is not likely to impact much!" point. I find it plausible that in some subfields, AI x-risk people are too prone to publishing due to historical precedent and norms (maybe mech interp? though little has actually come of that). I also want to point out that there are non-zero arguments to expect alignment work to help more with capabilties, relative to existing "mainstream" capabilities work, even if I don't believe this to be the case. (For example, you might believe that the field of deep learning spends too little time actually thinking about how to improve their models, and too much time just tinkering, in which case your thinking could have a disproportionate impact even after adjusting for the fact that you're not trying to do capabilities.) And I think that some of the research labeled "alignment" is basically just capabilities work, and maybe the people doing them should stop.
I also upvoted the post because I think this attitude is pervasive in these circles, and it's good to actually hash it out in public.
But as with most of the commenters, I disagree with the conclusion of the post.
I suspect the main cruxes between us are the following:
1. How much useful alignment work is actually being done?
From paragraphs such as the following:
It's very rare that any research purely helps alignment, because any alignment design is a fragile target that is just a few changes away from unaligned. There is no alignment plan which fails harmlessly if you fuck up implementing it, and people tend to fuck things up unless they try really hard not to (and often even if they do), and people don't tend to try really hard not to. This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned — a helpful AI will help anyone, and the world has more people trying to build any superintelligence (let's call those "capabilities researchers") than people trying to build aligned superintelligence (let's call those "alignment researchers").
And
"But my ideas/insights/research is not likely to impact much!" — that's not particularly how it works? It needs to somehow be differenially helpful to alignment, which I think is almost never the case.
It seems that a big part of your world model is that ~no one who thinks they're doing "alignment" work is doing real alignment work, and are really just doing capabilities work. In particular, it seems that you think interp or intent alignment are basically just capabilities work, insofar as their primary effect is helping people build unsafe ASI faster. Perhaps you think that, in the case of interp, before we can understand the AI in a way that's helpful for alignment, we'll understand it in a way that allows us to improve it. I'm somewhat sympathetic to this argument. But I think making it requires arguing that interp work doesn't really contribute to alignment at all, and is thus better thought of as capabilities work (and same for intent alignment).
Perhaps you believe that all alignment work is useless, not because they're misguided and actually capabilities work, but because we're so far from building aligned ASI that ~all alignment work is useless, and in the intermediate regime where additional insights non-negligibly hasten the arrival of unaligned ASI. But I think you should argue for that explicitly (as say, Eliezer did in his death with dignity post), since I imagine most of the commenters here would disagree with this take.
My guess is this is the largest crux between us; if I thought all "alignment" work did nothing for alignment, and was perhaps just capabilities work in disguise, then I would agree that people should stop. In fact, I might even argue that we should just stop all alignment work whatsoever! Insofar as I'm correct about this being a crux, I'd like to see a post explicitly arguing for the lack of alignment relevancy of existing 'alignment work', which will probably lead to a more constructive conversation than this post.
2. How many useful capabilities insights incidentally come from "alignment" work?
I think empirically, very few (if not zero) capabilities insights have come from alignment work. And a priori, you might expect that research that aims to solve topic X produces marginally more X than a related topic Y. Insofar as you think that current "alignment" work is more than epsilon useful, I think you would not argue that most alignment work is differentially negative. So insofar as you think a lot of "alignment" work is real alignment work, you probably believe that many capabilities insights have come from past alignment work.
Perhaps you're reluctant to give examples, for fear of highlighting them. I think the math doesn't work out here -- having a few clear examples from you would probably be sufficient to significantly reduce the number of published insights from the community as a whole. But, if you have many examples of insights that help capabilities but are too dangerous to highlight, I'd appreciate if you would just say that (and maybe we can find a trusted third party to verify your claim, but not share the details?).
Perhaps you might say, well, the alignment community is very small, so there might not be many examples that come to mind! To make this carry through, you'd still have to believe that the alignment community also hasn't produced much good research. (Even though, naively, you might expect higher returns from alignment due to there being more unpicked low-hanging fruit due to its small size.) But then again, I'd prefer if you explicitly argued that ~all alignment is either useless or capabilities instead of gesturing at a generic phenomenon.
Perhaps you might say that capabilities insights are incredibly long tailed, and thus seeing no examples doesn't mean that the expected harm is low. But, I think you still need to make some sort of plausibility argument here, as well as a story for why the existing ML insights deserve a lot of Shapley for capabilities advances, even though most of the "insights" people had were useless if not actively misleading.
I also think that there's an obvious confounder, if you believe something along the lines of "focusing on alignment is correlated with higher rationality". Personally, I also think the average alignment(-interested) researcher is more competent at machine learning or research in general than the average generic capabilities researcher (this probably becomes false once you condition on being at OAI, Anthropic, or another scaling lab). If you just count "how many good ideas came from 'alignment' researchers per capita" to the number for 'capability' researchers, you may find that the former is higher because they're just more competent. This goes back again into crux 1., where you then need to argue that competency doesn't help at all in doing actual alignment work, and again, I suspect it's more productive to just argue about the relevance and quality of alignment work instead of arguing about incidental capabilities insights.
3. How important are insights to alignment/capabilities work?
From paragraphs such as the following:
Worse yet: if focusing on alignment is correlated with higher rationality and thus with better ability for one to figure out what they need to solve their problems, then alignment researchers are more likely to already have the ideas/insights/research they need than capabilities researchers, and thus publishing ideas/insights/research about AI is more likely to differentially help capabilities researchers. Note that this is another relative statement; I'm not saying "alignment researchers have everything they need", I'm saying "in general you should expect them to need less outside ideas/insights/research on AI than capabilities researchers".
it seems that you're working with a model of research output with two main components -- (intrinsic) rationality and (external) insights. But there's a huge component that's missing from this model: actual empirical experiments validating the insight, which is the ~bulk of actual capabilities work and a substantial fraction of alignment work. This matters both because ~no capabilities researchers will listen to you if you don't have empirical experiments, and because, if you believe that you can deduce more alignment research "on your own", you might also believe that you need to do more empirical experiments to do capabilities research (and thus that the contribution per insight is by default a lot smaller).
Even if true insights are differentially more helpful for capabilities, the fact that it seems empirically difficult to know which insights are true means that a lot of the work in getting a true insight will involve things that look a lot more like normal capabilities work -- e.g. training more capable models. But surely then, the argument would be reducable to: if you do capabilities work, don't share it on pain of accelerating ASI progress -- which seems like something your audience already agrees with!
That being said, I think I might disagree with your premise here. My guess is that alignment, by being less grounded than capabilities, probably requires more oustide ideas/insights/research, just for sanity checking reasons (once you control for competence of researcher and the fact that there's probably more low-hanging fruit in alignment). After all, you can just make a change and see if your log loss on pretraining goes down, but it's a lot harder to know if your model of deceptive alignment actually is at all sensible. If you don't improve your model's performance on standard benchmarks, then this is evidence that your capability idea doesn't work, but there aren't even really any benchmarks for many of the problems alignment researchers think about. So it's easier to go astray, and therefore more important to get feedback from other researchers.
Finally, to answer this question:
"So where do I privately share such research?" — good question!
I suspect that the way to go is to form working groups of researchers that stick together, and that maintain a high level of trust. e.g. a research organization. Then, do and share your research internally and think about possible externalities before publishing more broadly, perhaps doing a tiered release. (This is indeed the model used by many people in alignment orgs.)
While I've softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:
- It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
- [..]
- High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
(emphasis added)
I think it's often challenging to just understand where the frontier is, because it's so far and so many things are secret. And if you're not at a scaling lab and then also don't keep up with the frontier of the literature, it's natural to overestimate the novelty of your insights. And then, if you're too scared to investigate your insights, you might continue to think that your ideas are better than they are. Meanwhile, as an AI Safety researcher, not only is there a lot less distance to the frontier of whatever subfield you're in, you'll probably spend most of your time doing work that keeps you on the frontier.
Random insights can be valuable, but the history of deep learning is full of random insights that were right but for arguably the wrong reasons (batch/layernorm, Adam, arguably the algorithm that would later be rebranded as PPO), as well as brilliant insights that turned out to be basically useless (e.g. consider a lot of the Bayesian neural network stuff, but there's really too many examples to list) if not harmful in the long run (e.g. lots of "clever" or not-so-clever ways of adding inductive bias). Part of the reason is that people don't get taught the history of the field, and see all the oh-so-clever ideas that didn't work, or how a lot of the "insights" were invented post-hoc. So if you're new to deep learning you might get the impression that insights were more causally responsible for the capabilities advancements, than they actually are. Insofar as good alignment requires deconfusion and rationality to generate good insights, and capabilities does not, then you should expect that the insights you get from improving rationality/doing deconfusion are more impactful for alignment than capabilities.
I mean, if you actually do come up with a better initialization scheme, a trick that improves GPU utilization, or some other sort of cheap algorithmic trick to improve training AND check it's correct through some small/medium-scale empirical experiments, then sure, please reconsider publishing that. But it's hard to incidentally do that -- even if you do come up with some insight while doing say, mech interp, it feels like going out of your way to test your capability ideas should be a really obvious "you're basically doing capabilities" sign? And maybe, you should be doing the safety work you claim to want to do instead?
I don't know what the "real story" is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying "something something weight norm ergo generalizing". Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.
One issue is that our picture doesn't consider learning dynamics that seem actually important here. For example, it seems that one of the mechanisms that may explain why weight decay seems to matter so much in the Omnigrok paper is because fixing the norm to be large leads to an effectively tiny learning rate when you use Adam (which normalizes the gradients to be of fixed scale), especially when there's a substantial radial component (which there is, when the init is too small or too big). This both probably explains why they found that training error was high when they constrain the weights to be sufficiently large in all their non-toy cases (see e.g. the mod add landscape below) and probably explains why we had difficulty using SGD+momentum (which, given our bad initialization, led to gradients that were way too big at some parts of the model especially since we didn't sweep the learning rate very hard). [1]
There's also some theoretical results from SLT-related folk about how generalizing circuits achieve lower train loss per parameter (i.e. have higher circuit efficiency) than memorizing circuits (at least for large p), which seems to be a part of the puzzle that neither our work nor the Omnigrok touched on -- why is it that generalizing solutions have lower norm? IIRC one of our explanations was that weight decay "favored more distributed solutions" (somewhat false) and "it sure seems empirically true", but we didn't have anything better than that.
There was also the really basic idea of how a relu/gelu network may do multiplication (by piecewise linear approximations of x^2, or by using the quadratic region of the gelu for x^2), which (I think) was first described in late 2022 in Ekin Ayurek's "Transformers can implement Sherman-Morris for closed-form ridge regression" paper? (That's not the name, just the headline result.)
Part of the story for grokking in general may also be related to the Tensor Program results that claim the gradient on the embedding is too small relative to the gradient on other parts of the model, with standard init. (Also the embed at init is too small relative to the unembed.) Because the embed is both too small and do, there's no representation learning going on, as opposed to just random feature regression (which overfits in the same way that regression on random features overfits absent regularization).
In our case, it turns out not to be true (because our network is tiny? because our weight decay is set aggressively at lamba=1?), since the weights that directly contribute to logits (W_E, W_U, W_O, W_V, W_in, W_out) all quickly converge to the same size (weight decay encourages spreading out weight norm between things you multiply together), while the weights that do not all converge to zero.
Bringing it back to the topic at hand: There's often a lot more "small" confusions that remain, even after doing good toy models work. It's not clear how much any of these confusions matter (and do any of the grokking results our paper, Ziming Liu et al, or the GDM grokking paper found matter?).
- ^
Haven't checked, might do this later this week.
I think the key takeaway I wanted people to get is that superposition is something novel and non-trivial, and isn't just a standard polysemantic neuron thing. I wrote this post in response to two interactions where people assumed that superposition was just polysemanticity.
It turned out that a substantial fraction of the post went the other way (i.e. talking about non-superposition polysemanticity), so maybe?
Also have you looked at the dot product of each of the SAE directions/SAE reconstructed representaitons with the image net labels fed through the text encoder??
Cool work!
As with Arthur, I'm pretty surprised by. how much easier vision seems to be than text for interp (in line with previous results). It makes sense why feature visualization and adversarial attacks work better with continuous inputs, but if it is true that you need fewer datapoints to recover concepts of comparable complexity, I wonder if it's a statement about image datasets or about vision in general (e.g. "abstract" concepts are more useful for prediction, since the n-gram/skip n-gram/syntactical feature baseline is much weaker).
I think the most interesting result to me is your result where the went down (!!):
Note that the model with the SAE attains a lower loss than the original model. It is not clear to me why this is the case. In fact, the model with the SAE gets a lower loss than the original model within 40 000 training tokens.
My guess is this happens because CLIP wasn't trained on imagenet -- but instead a much larger dataset that comes from a different distribution. A lot of the SAE residual probably consists of features that are useful in on the larger dataset, but not imagenet. If you extract the directions of variation on imagenet instead of OAI's 400m image-text pair dataset, it makes sense why reconstructing inputs using only these directions lead to better performance on the dataset you found these inputs on.
I'm not sure how you computed the contrastive loss here -- is it just the standard contrastive loss, but on image pairs instead of image/text pairs (using the SAE'ed ViT for both representations), or did you use the contextless class label as the text input here (only SAE'ing the ViT part but not the text encoder). Either way, this might add additional distributional shift.
(And I could be misunderstanding what you did entirely, and that you actually looked at contrastive loss on the original dataset somehow, in which case the explanation I gave up doesn't apply.)