Posts

Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) 2023-12-23T02:46:25.892Z
Fact Finding: How to Think About Interpreting Memorisation (Post 4) 2023-12-23T02:46:16.675Z
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3) 2023-12-23T02:46:05.517Z
Fact Finding: Simplifying the Circuit (Post 2) 2023-12-23T02:45:49.675Z
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) 2023-12-23T02:44:24.270Z
Discussion: Challenges with Unsupervised LLM Knowledge Discovery 2023-12-18T11:58:39.379Z
Explaining grokking through circuit efficiency 2023-09-08T14:39:23.910Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes 2023-05-01T16:47:41.655Z
[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy 2023-03-07T11:55:01.131Z
Categorizing failures as “outer” or “inner” misalignment is often confused 2023-01-06T15:48:51.739Z
Definitions of “objective” should be Probable and Predictive 2023-01-06T15:40:30.813Z
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques 2022-11-25T14:36:08.948Z
Threat Model Literature Review 2022-11-01T11:03:22.610Z
Clarifying AI X-risk 2022-11-01T11:03:01.144Z
More examples of goal misgeneralization 2022-10-07T14:38:00.288Z
[AN #173] Recent language model results from DeepMind 2022-07-21T02:30:02.115Z
[AN #172] Sorry for the long hiatus! 2022-07-05T06:20:03.943Z
DeepMind is hiring for the Scalable Alignment and Alignment Teams 2022-05-13T12:17:13.157Z
Learning the smooth prior 2022-04-29T21:10:18.064Z
Shah and Yudkowsky on alignment failures 2022-02-28T19:18:23.015Z
[AN #171]: Disagreements between alignment "optimists" and "pessimists" 2022-01-21T18:30:04.824Z
Conversation on technology forecasting and gradualism 2021-12-09T21:23:21.187Z
[AN #170]: Analyzing the argument for risk from power-seeking AI 2021-12-08T18:10:04.022Z
[AN #169]: Collaborating with humans without human data 2021-11-24T18:30:03.795Z
[AN #168]: Four technical topics for which Open Phil is soliciting grant proposals 2021-10-28T17:20:03.387Z
[AN #167]: Concrete ML safety problems and their relevance to x-risk 2021-10-20T17:10:03.690Z
[AN #166]: Is it crazy to claim we're in the most important century? 2021-10-08T17:30:11.819Z
[AN #165]: When large models are more likely to lie 2021-09-22T17:30:04.674Z
[AN #164]: How well can language models write code? 2021-09-15T17:20:03.850Z
[AN #163]: Using finite factored sets for causal and temporal inference 2021-09-08T17:20:04.522Z
[AN #162]: Foundation models: a paradigm shift within AI 2021-08-27T17:20:03.831Z
[AN #161]: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity 2021-08-20T17:20:04.380Z
[AN #160]: Building AIs that learn and think like people 2021-08-13T17:10:04.335Z
[AN #159]: Building agents that know how to experiment, by training on procedurally generated games 2021-08-04T17:10:03.823Z
[AN #158]: Should we be optimistic about generalization? 2021-07-29T17:20:03.409Z
[AN #157]: Measuring misalignment in the technology underlying Copilot 2021-07-23T17:20:03.424Z
[AN #156]: The scaling hypothesis: a plan for building AGI 2021-07-16T17:10:05.809Z
BASALT: A Benchmark for Learning from Human Feedback 2021-07-08T17:40:35.045Z
[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions 2021-07-08T17:20:02.518Z
[AN #154]: What economic growth theory has to say about transformative AI 2021-06-30T17:20:03.292Z
[AN #153]: Experiments that demonstrate failures of objective robustness 2021-06-26T17:10:02.819Z
[AN #152]: How we’ve overestimated few-shot learning capabilities 2021-06-16T17:20:04.454Z
[AN #151]: How sparsity in the final layer makes a neural net debuggable 2021-05-19T17:20:04.453Z
[AN #150]: The subtypes of Cooperative AI research 2021-05-12T17:20:27.267Z
[AN #149]: The newsletter's editorial policy 2021-05-05T17:10:03.189Z
[AN #148]: Analyzing generalization across more axes than just accuracy or loss 2021-04-28T18:30:03.066Z
FAQ: Advice for AI Alignment Researchers 2021-04-26T18:59:52.589Z

Comments

Comment by Rohin Shah (rohinmshah) on Explaining grokking through circuit efficiency · 2024-04-26T16:09:58.499Z · LW · GW

Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?

Perhaps under normal circumstances both are learned so fast that you just don't notice that one is slower than the other, and this slows both of them down enough that you can see the difference?

Comment by Rohin Shah (rohinmshah) on AXRP Episode 29 - Science of Deep Learning with Vikrant Varma · 2024-04-26T08:56:08.285Z · LW · GW

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.

Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant circumstantial evidence) and (b) the amount of interference differs significantly based on which frequencies you use (which in turn changes the quality of the logits holding parameter norm fixed, and thus changes efficiency).

In principle this can be tested by randomly sampling frequency sets, simulating the level of interference you'd get, using that to estimate the efficiency + critical dataset size for that grokking circuit. This gives you a predicted distribution over critical dataset sizes, which you could compare against the actual distribution.

Tbc there are other hypotheses too, e.g. perhaps different frequency sets are easier / harder to implement by the neural network architecture.

Comment by Rohin Shah (rohinmshah) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-26T06:50:24.121Z · LW · GW

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs.

The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to , so you might think of  as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on  for everything else.

Comment by Rohin Shah (rohinmshah) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-25T22:54:11.997Z · LW · GW

Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.

This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio . Looking at the  that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regularization, and not capturing the "inherent" need to shrink the vector given these nonzero angles. (In particular, if we computed  for Gated SAEs, I expect that would be below 1.)

I think the main thing we got wrong is that we accidentally treated  as though it were . To the extent that was the main mistake, I think it explains why our results still look how we expected them to -- usually  is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small.

We're going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it's probably worth waiting for that -- I expect we'll provide much more detailed derivations that make everything a lot clearer.

Comment by Rohin Shah (rohinmshah) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-25T21:52:45.883Z · LW · GW

Possibly I'm missing something, but if you don't have , then the only gradients to  and  come from  (the binarizing Heaviside activation function kills gradients from ), and so  would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".)

(You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with  from the beginning of Section 3.2.2.)

Comment by Rohin Shah (rohinmshah) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-19T07:59:33.163Z · LW · GW

Is it accurate to summarize the headline result as follows?

  • Train a Transformer to predict next tokens on a distribution generated from an HMM.
  • One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in, and perform Bayesian updating on each new token. That is, it maintains .
  • Key result: A linear probe on the residual stream is able to reconstruct .

(I don't know what Computational Mechanics or MSPs are so this could be totally off.)

EDIT: Looks like yes. From this post:

Part of what this all illustrates is that the fractal shape is kinda… baked into any Bayesian-ish system tracking the hidden state of the Markov model. So in some sense, it’s not very surprising to find it linearly embedded in activations of a residual stream; all that really means is that the probabilities for each hidden state are linearly represented in the residual stream.

Comment by Rohin Shah (rohinmshah) on AI #57: All the AI News That’s Fit to Print · 2024-04-05T08:50:12.509Z · LW · GW

How certain are you that this is always true

My probability that (EDIT: for the model we evaluated) the base model outperforms the finetuned model (as I understand that statement) is so small that it is within the realm of probabilities that I am confused about how to reason about (i.e. model error clearly dominates). Intuitively (excluding things like model error), even 1 in a million feels like it could be too high.

My probability that the model sometimes stops talking about some capability without giving you an explicit refusal is much higher (depending on how you operationalize it, I might be effectively-certain that this is true, i.e. >99%) but this is not fixed by running evals on base models.

(Obviously there's a much much higher probability that I'm somehow misunderstanding what you mean. E.g. maybe you're imagining some effort to elicit capabilities with the base model (and for some reason you're not worried about the same failure mode there), maybe you allow for SFT but not RLHF, maybe you mean just avoid the safety tuning, etc)

Comment by Rohin Shah (rohinmshah) on AI #57: All the AI News That’s Fit to Print · 2024-04-04T06:48:16.335Z · LW · GW

Oh yes, sorry for the confusion, I did mean "much less capable".

Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal? Certainly if we encountered that we would figure out some way to make that not happen any more.

Comment by Rohin Shah (rohinmshah) on AI #57: All the AI News That’s Fit to Print · 2024-03-29T19:39:30.362Z · LW · GW

Surely you mean something else, e.g. models without safety tuning? If you run them on base models the scores will be much worse.

Comment by Rohin Shah (rohinmshah) on AI #57: All the AI News That’s Fit to Print · 2024-03-29T19:36:56.065Z · LW · GW

(Speaking only for myself. This may not represent the views of even the other paper authors, let alone Google DeepMind as a whole.)

Did you notice that Gemini Ultra did worse than Gemini Pro at many tasks? This is even true under ‘honest mode’ where the ‘alignment’ or safety features of Ultra really should not be getting in the way. Ultra is in many ways flat out less persuasive. But clearly it is a stronger model. So what gives?

Fwiw, my sense is that a lot of the persuasion results are being driven by factors outside of the model's capabilities, so you shouldn't conclude too much from Pro outperforming Ultra.

For example, in "Click Links" one pattern we noticed was that you could get surprisingly (to us) good performance just by constantly repeating the ask (this is called "persistence" in Table 3) -- apparently this does actually make it more likely that the human does the thing (instead of making them suspicious, as I would have initially guessed). I don't think the models "knew" that persistence would pay off and "chose" that as a deliberate strategy; I'd guess they had just learned a somewhat myopic form of instruction-following where on every message they are pretty likely to try to do the thing we instructed them to do (persuade people to click on the link). My guess is that these sorts of factors varied in somewhat random ways between Pro and Ultra, e.g. maybe Ultra was better at being less myopic and more subtle in its persuasion -- leading to worse performance on Click Links.

That is driven home even more on the self-proliferation tasks, why does Pro do better on 5 out of 9 tasks?

Note that lower is better on that graph, so Pro does better on 4 tasks, not 5. All four of the tasks are very difficult tasks where both Pro and Ultra are extremely far from solving the task -- on the easier tasks Ultra outperforms Pro. For the hard tasks I wouldn't read too much into the exact numeric results, because we haven't optimized the models as much for these settings. For obvious reasons, helpfulness tuning tends to focus on tasks the models are actually capable of doing. So e.g. maybe Ultra tends to be more confident in its answers on average to make it more reliable at the easy tasks, at the expense of being more confidently wrong on the hard tasks. Also in general the methodology is hardly perfect and likely adds a bunch of noise; I think it's likely that the differences between Pro and Ultra on these hard tasks are smaller than the noise.

This is also a problem. If you only use ‘minimal’ scaffolding, you are only testing for what the model can do with minimal scaffolding. The true evaluation needs to use the same tools that it will have available when you care about the outcome. This is still vastly better than no scaffolding, and provides the groundwork (I almost said ‘scaffolding’ again) for future tests to swap in better tools.

Note that the "minimal scaffolding" comment applied specifically to the persuasion results; the other evaluations involved a decent bit of scaffolding (needed to enable the LLM to use a terminal and browser at all).

That said, capability elicitation (scaffolding, tool use, task-specific finetuning, etc) is one of the priorities for our future work in this area.

Fundamentally what is the difference between a benchmark capabilities test and a benchmark safety evaluation test like this one? They are remarkably similar. Both test what the model can do, except here we (at least somewhat) want the model to not do so well. We react differently, but it is the same tech.

Yes, this is why we say these are evaluations for dangerous capabilities, rather than calling them safety evaluations.

I'd say that the main difference is that dangerous capability evaluations are meant to evaluate plausibility of certain types of harm, whereas a standard capabilities benchmark is usually meant to help with improving models. This means that standard capabilities benchmarks often have as a desideratum that there are "signs of life" with existing models, whereas this is not a desideratum for us. For example, I'd say there are basically no signs of life on the self-modification tasks; the models sometimes complete the "easy" mode but the "easy" mode basically gives away the answer and is mostly a test of instruction-following ability.

Perhaps we should work to integrate the two approaches better? As in, we should try harder to figure out what performance on benchmarks of various desirable capabilities also indicate that the model should be capable of dangerous things as well.

Indeed this sort of understanding would be great if we could get it (in that it can save a bunch of time). My current sense is that it will be quite hard, and we'll just need to run these evaluations in addition to other capability evaluations.

Comment by Rohin Shah (rohinmshah) on Some (problematic) aesthetics of what constitutes good work in academia · 2024-03-13T13:32:53.056Z · LW · GW

It was helpfully explaining a possibly-confusing conceptual point. It would have made a nice little blog post. Alas! After the authors translated their nice little conceptual clarification into academic-ese, including thorough literature reviews, formalizations, and so on, it came out to 22 pages.

Fwiw I don't think the main paper would have been much shorter if we'd aimed to write a blog post instead, unless we changed our intended audience. It's a sufficiently nuanced conceptual point that you do need most of the content that is in there.

We could have avoided the appendices, but then we're relying on people to trust us when we make a claim that something is a theorem, since we're not showing the proof. We could have avoided implementing the examples in a real codebase, though I do think iterating on the examples in actual code made them better, and also people wouldn't have believed us when we said you can solve this with deep RL (in fact even after we actually implemented it some people still didn't believe me, or at least were very confused, when I said that).

Iirc I was more annoyed by the peer reviews for similar reasons to what you say.

(Btw you can see some of my thoughts on this topic in the answer to "So what does academia care about, and how is it different from useful research?" in my FAQ.)

Comment by Rohin Shah (rohinmshah) on Analogies between scaling labs and misaligned superintelligent AI · 2024-02-22T22:44:59.468Z · LW · GW

I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g.

Misaligned Incentives

In much the same way that AI systems may have perverse incentives, so do the [AI safety researchers]. They are [humans]. They need to make money, [feed themselves, and attract partners]. [Redacted and redacted even just got married.] This type of accountability to [personal] interests is not perfectly in line with doing what is good for human interests. Moreover, [AI safety researchers are often] technocrats whose values and demographics do not represent humanity particularly well. Optimizing for the goals that the [AI safety researchers] have is not the same thing as optimizing for human welfare. Goodhart’s Law applies. 

I feel pretty similarly about most of the other arguments in this post.

Tbc I think there are plenty of things one could reasonably critique scaling labs about, I just think the argumentation in this post is by and large off the mark, and implies a standard that if actually taken literally would be a similarly damning critique of the alignment community.

(Conflict of interest notice: I work at Google DeepMind.)

Comment by Rohin Shah (rohinmshah) on The case for ensuring that powerful AIs are controlled · 2024-01-28T18:44:56.995Z · LW · GW

Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).

EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.

Comment by Rohin Shah (rohinmshah) on The case for ensuring that powerful AIs are controlled · 2024-01-28T18:23:11.060Z · LW · GW

Cool, that all roughly makes sense to me :)

I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.

Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, ...)

Suppose today I gave you a device where you put in moderately detailed instructions for experiments, and the device returns the results[1] with N minutes of latency and infinite throughput. Do you think you can spend 1 working day using this device to produce the same output as 4 copies of yourself working in parallel for a week (and continue to do that for months, after you've exhausted low-hanging fruit)?

... Having written this hypothetical out, I am finding it more plausible than before, at least for small enough N, though it still feels quite hard at e.g. N = 60.

  1. ^

    The experiments can't use too much compute. No solving the halting problem.

Comment by Rohin Shah (rohinmshah) on The case for ensuring that powerful AIs are controlled · 2024-01-28T16:36:00.235Z · LW · GW

I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).

I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained model, e.g. maybe (1) finetuning starts taking fewer data points as model size increases (sample efficiency improves with model capability), and so finetuning runs become a rounding error on compute, and (2) the vast majority of ML research progress involves nothing more expensive than finetuning runs. (Though in this world you have to wonder why we keep training bigger models instead of just investing solely in better finetuning the current biggest model.)

Another thing that occurred to me is that latency starts looking like another major bottleneck. Currently it seems feasible to make a paper's worth of progress in ~6 months. With a 30x speedup, you now have to do that in 6 days. At that scale, introducing additional latency via experiments at small scales is a huge cost. 

(I'm assuming here that the ideas and overall workflow are still managed by human researchers, since your hypothetical said that the AIs are just going from high level ideas to implemented experiments. If you have fully automated AI researchers then they don't need to optimize latency as hard; they can instead get 30x speedup by having 30x as many researchers working but still producing a paper every 6 months.)

(Another possibility is that human ML researchers get really good at multi-tasking, and so e.g. they have 5 paper-equivalents at any given time, each of which takes 30 calendar days to complete. But I don't believe that (most) human ML researchers are that good at multitasking on research ideas, and there isn't that much time for them to learn.)

It also seems hard for the human researchers to have ideas good enough to turn into paper-equivalents every 6 days. Also hard for those researchers to keep on top of the literature well enough to be proposing stuff that actually makes progress rather than duplicating existing work they weren't aware of, even given AI tools that help with understanding the literature.

Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training.

Tbc the fact that running your automated ML implementers takes compute was a side point; I'd be making the same claims even if running the AIs was magically free.

Though even at a billion token-equivalents per second it seems plausible to me that your automated ML experiment implementers end up being a significant fraction of that compute. It depends quite significantly on how capable a single forward pass is, e.g. can the AI just generate an entire human-level pull request autoregressively (i.e. producing each token of the PR one at a time, without going back to fix errors) vs does it do similar things as humans (write tests and code, test, debug, eventually submit) vs. does it do way more iteration and error correction than humans (in parallel to avoid crazy high latency), do we use best-of-N sampling or similar tricks to improve quality of generations, etc.

Comment by Rohin Shah (rohinmshah) on The case for ensuring that powerful AIs are controlled · 2024-01-27T09:35:24.478Z · LW · GW

I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)

Why doesn't compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.

Comment by Rohin Shah (rohinmshah) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-18T07:31:17.483Z · LW · GW

Come on, the claim "the evidence suggests that if the current ML systems were trying to deceive us, we wouldn't be able to change them not to" absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I'm on board with Ryan's phrasing).

There's a huge disanalogy between your paper's setup and deception-in-general, which is that in your paper's setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has behavioral impacts at training time and that's by far the main reason to expect that we could address it.

Fwiw I thought the paper was pretty good at being clear that it was specifically deceptive alignment and backdoors that the claim applied to. But if you're going to broaden that to a claim like "the evidence suggests that if the current ML systems were trying to deceive us, we wouldn't be able to change them not to" without any additional qualifiers I think that's a pretty big overclaim, and also I want to bet you on whether we can reduce sycophancy today.

Comment by Rohin Shah (rohinmshah) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-12T17:20:01.940Z · LW · GW

I think you mostly need to hope that it doesn't matter (because the crazy XOR directions aren't too salient) or come up with some new idea.

Yeah certainly I'd expect the crazy XOR directions aren't too salient.

I'll note that if it ends up these XOR directions don't matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you're more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.

Imo "true according to Alice" is nowhere near as "crazy" a feature as "has_true XOR has_banana". It seems useful for the LLM to model what is true according to Alice! (Possibly I'm misunderstanding what you mean by "crazy" here.)

I'm not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don't see a great reason to expect unsupervised methods to work better.

If I had to articulate my reason for being surprised here, it'd be something like:

  1. I didn't expect LLMs to compute many XORs incidentally
  2. I didn't expect LLMs to compute many XORs because they are useful

but lots of XORs seem to get computed anyway.

This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don't yet understand. So I shouldn't feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn't seem like an important point.

Comment by Rohin Shah (rohinmshah) on Simulators · 2024-01-12T09:27:22.328Z · LW · GW

Yeah, agreed that's a clear overclaim.

In general I believe that many (most?) people take it too far and make incorrect inferences -- partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.

Fwiw I was sympathetic to nostalgebraist's positive review saying:

sometimes putting a name to what you "already know" makes a whole world of difference. [...] I see these takes, and I uniformly respond with some version of the sentiment "it seems like you aren't thinking of GPT as a simulator!"

I think in all three of the linked cases I broadly directionally agreed with nostalgebraist, and thought that the Simulator framing was at least somewhat helpful in conveying the point. The first one didn't seem that important (it was critiquing imo a relatively minor point), but the second and third seemed pretty direct rebuttals of popular-ish views. (Note I didn't agree with all of what was said, e.g. nostalgebraist doesn't seem at all worried about a base GPT-1000 model, whereas I would put some probability on doom for malign-prior reasons. But this feels more like "reasonable disagreement" than "wildly misled by simulator framing".)

Comment by Rohin Shah (rohinmshah) on Simulators · 2024-01-11T20:15:30.869Z · LW · GW

Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?

Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn't reread this post carefully but I don't remember seeing mechanistic claims in it.

I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that's because that's basically a description of the task the LLM is trained on. [...]

I mostly agree and this is an aspect of what I mean by "this post says obvious and uncontroversial things". I'm not particularly advocating for this post in the review; I didn't find it especially illuminating.

To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of "figure out what real-world process would produce text like this" can't be accurate

This seems somewhat in conflict with the previous quote?

Re: the concrete counterexample, yes I am in fact only making claims about base models; I agree it doesn't work for RLHF'd models. Idk how you want to weigh the fact that this post basically just talks about base models in your review, I don't have a strong opinion there.

I think it is in fact hard to get a base model to combine pieces of knowledge that tend not to be produced by any given human (e.g. writing an epistemically sound rap on the benefits of blood donation), and that often the strategy to get base models to do things like this is to write a prompt that makes it seem like we're in the rare setting where text is being produced by an entity with those abilities.

Comment by Rohin Shah (rohinmshah) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-11T19:52:53.823Z · LW · GW

The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything.

Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not".

(Tbc here the hypothesis could be "the model computes XORs with has_not all the time, and then uses only some of them", so it does have some aspect of "compute lots of XORs", but it is still a hypothesis that clearly by default doesn't produce multiway XORs.)

In contrast, the point I'm trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.[1]

  1. ^

If you want you could rephrase this issue as " and  are spuriously correlated in training," so I guess I should say "even in the absence of spurious correlations among basic features."

... That's exactly how I would rephrase the issue and I'm not clear on why you're making a sharp distinction here.

As you noted, it will sometimes be the case that XOR features are more like basic features than derived features, and thus will be represented with high salience. I think incidental hypotheses will have a really hard time explaining this -- do you agree?

I mean, I'd say the ones that are more like basic features are like that because it was useful, and it's all the other XORs that are explained by incidental hypotheses. The incidental hypotheses shouldn't be taken to be saying that all XORs are incidental, just the ones which aren't explained by utility. Perhaps a different way of putting it is that I expect both utility and incidental hypotheses to be true to some extent.

Maybe on your model this is something simple like the weights computing the basic features being larger than weights computing derived features? If so, that's the tracking I'm talking about, and is a potential thread to pull on for distinguishing basic vs. derived features using model internals.

Yes, on my model it could be something like the weights for basic features being large. It's not necessarily that simple, e.g. it could also be that the derived features are in superposition with a larger number of other features that leads to more interference. If you're calling that "tracking", fair enough I guess; my main claim is that it shouldn't be surprising. I agree it's a potential thread for distinguishing such features.

Comment by Rohin Shah (rohinmshah) on Simulators · 2024-01-11T08:09:28.211Z · LW · GW

I think the main thing I'd point to is this section (where I've changed bullet points to numbers for easier reference):

I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:

  1. The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
  2. It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the former.
  3. It’s not confusing that multiple simulacra can be instantiated at once, or an agent embedded in a tragedy, etc.
  4. It does not imply that the AI’s behavior is well-described (globally or locally) as expected utility maximization. An arbitrarily powerful/accurate simulation can depict arbitrarily hapless sims.
  5. It does not imply that the AI is only capable of emulating things with direct precedent in the training data. A physics simulation, for instance, can simulate any phenomena that plays by its rules.
  6. It emphasizes the role of the model as a transition rule that evolves processes over time. The power of factored cognition / chain-of-thought reasoning is obvious.
  7. It emphasizes the role of the state in specifying and constructing the agent/process. The importance of prompt programming for capabilities is obvious if you think of the prompt as specifying a configuration that will be propagated forward in time.
  8. It emphasizes the interactive nature of the model’s predictions – even though they’re “just text”, you can converse with simulacra, explore virtual environments, etc.
  9. It’s clear that in order to actually do anything (intelligent, useful, dangerous, etc), the model must act through simulation of something.

I think (2)-(8) are basically correct, (1) isn't really a claim, and (9) seems either false or vacuous. So I mostly feel like the core thesis as expressed in this post is broadly correct, not wrong. (I do feel like people have taken it further than is warranted, e.g. by expecting internal mechanisms to actually involve simulations, but I don't think those claims are in this post.)

I also think it does in fact constrain expectations. Here's a claim that I think this post points to: "To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do". Taken literally this is obviously false (e.g. you can know that GPT is not going to factor a large prime). But it's a good first-order approximation, and I would still use that as an important input if I were to predict today how a base model is going to continue to complete text.

(Based on your other comments maybe you disagree with the last paragraph? That surprises me. I want to check that you are specifically thinking of base models and not RLHF'd or instruction tuned models.)

Personally I agree with janus that these are (and were) mostly obvious and uncontroversial things -- to people who actually played with / thought about LLMs. But I'm not surprised that LWers steeped in theoretical / conceptual thinking about EU maximizers and instrumental convergence without much experience with practical systems (at least at the time this post was written) found these claims / ideas to be novel.

Comment by Rohin Shah (rohinmshah) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-10T09:58:39.214Z · LW · GW

Nice post, and glad this got settled experimentally! I think it isn't quite as counterintuitive as you make it out to be -- the observations seem like they have reasonable explanations.

I feel pretty confident that there's a systematic difference between basic features and derived features, where the basic features are more "salient" -- I'll be assuming such a distinction in the rest of the comment.

(I'm saying "derived" rather than "XOR" because it seems plausible that some XOR features are better thought of as "basic", e.g. if they were very useful for the model to compute. E.g. the original intuition for CCS is that "truth" is a basic feature, even though it is fundamentally an XOR in the contrast pair approach.)

For the more mechanistic explanations, I want to cluster them into two classes of hypotheses:

  1. Incidental explanations: Somehow "high-dimensional geometry" and "training dynamics" means that by default XORs of basic features end up being linearly represented as a side effect / "by accident". I think Fabien's experiments and Hoagy's hypothesis fit here.
    1. I think most mechanistic explanations here will end up implying a decay postulate that says "the extent to which an incidental-XOR happens decays as you have XORs amongst more and more basic features". This explains why basic features are more salient than derived features.
  2. Utility explanations: Actually it's often quite useful for downstream computations to be able to do logical computations on boolean variables, so during training there's a significant incentive to represent the XOR to make that happen.
    1. Here the reason basic features are more salient is that basic features are more useful for getting low loss, and so the model allocates more of its "resources" to those features. For example, it might use more parameter norm (penalized by weight decay) to create higher-magnitude activations for the basic features.

I think both of the issues you raise have explanations under both classes of hypotheses.

Exponentially many features:

An easy counting argument shows that the number of multi-way XORs of N features is ~. [...] There are two ways to resist this argument, which I’ll discuss in more depth later in “What’s going on?”:

  • To deny that XORs of basic features are actually using excess model capacity, because they’re being represented linearly “by accident” or as an unintended consequence of some other useful computation. (By analogy, the model automatically linearly represents ANDs of arbitrary features without having to expend extra capacity.)
  • To deny forms of RAX that imply multi-way XORs are linearly represented, with the model somehow knowing to compute  and , but not .

While I think the first option is possible, my guess is that it's more like the second option.

On incidental explanations, this is explained by the decay postulate. For example, maybe once you hit 3-way XORs, the incidental thing is much less likely to happen, and so you get ~ pairwise XORs instead of the full ~ set of multi-way XORs.

On utility explanations, you would expect that multi-way XORs are much less useful for getting low loss than two-way XORs, and so computation for multi-way XORs is never developed.

Generalization:

logistic regression on the train set would learn the direction  where  is the direction representing a feature f. [...] the argument above would predict that linear probes will completely fail to generalize from train to test. This is not the result that we typically see [...]

One of these assumptions involves asserting that “basic” feature directions (those corresponding to a and b) are “more salient” than directions representing XORs – that is, the variance along  and  is larger than variance along . However, I’ll note that:

  • it’s not obvious why something like this would be true, suggesting that we’re missing a big part of the story for why linear probes ever generalize;
  • even if “basic” feature directions are more salient, the argument here still goes through to a degree, implying a qualitatively new reason to expect poor generalization from linear probes.

For the first point I'd note that (1) the decay postulate for incidental explanations seems so natural and (2) the "derived features are less useful than basic features and so have less resources allocated to them" seems sufficient for utility explanations.

For the second point, I'm not sure that the argument does go through. In particular you now have two possible outs:

  1. Maybe if  is twice as salient as , you learn a linear probe that is entirely , or close enough to it (e.g. if it is exponentially closer). I'd guess this isn't the explanation, but I don't actually know what linear probe learning theory predicts here.
  2. Even if you do learn , it doesn't seem obvious that test accuracy should be < 100%. In particular, if  is more salient by having activations that are twice as large, then it could be that even when b flips from 0 to 1 and  is reversed,  still overwhelms  and so every input is still classified correctly (with slightly less confidence than before).

On the other hand, RAX introduces a qualitatively new way that linear probes can fail to learn good directions. Suppose a is a feature you care about (e.g. “true vs. false statements”) and b is some unrelated feature which is constant in your training data (e.g. b = “relates to geography”). [...]

This is wild. It implies that you can’t find a good direction for your feature unless your training data is diverse with respect to every feature that your LLM linearly represents.

Fwiw, failures like this seem plausible without RAX as well. We explicitly make this argument in our goal misgeneralization paper (bottom of page 9 / Section 4.2), and many of our examples follow this pattern (e.g. in Monster Gridworld, you see a distribution shift from "there is almost always a monster present" in training to "there are no monsters present" at test time).

I agree strong RAX without any saliency differences between features would imply this problem is way more widespread than it seems to be in practice, but I don't think it's a qualitatively new kind of generalization failure (and also I think strong RAX without saliency differences is clearly false).

Maybe models track which features are basic and enforce that these features be more salient

In other words, maybe the LLM is recording somewhere the information that a and b are basic features; then when it goes to compute , it artificially makes this direction less salient. And when the model computes a new basic feature as a boolean function of other features, it somehow notes that this new feature should be treated as basic and artificially increases the salience along the new feature direction.

I don't think the model has to do any active tracking; on both hypotheses this happens by default (in incidental explanations, because of the decay postulate, and in utility explanations, because the  feature is less useful and so fewer resources go towards computing it).

Comment by Rohin Shah (rohinmshah) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-20T08:49:28.211Z · LW · GW

Are you saying that this claim is supported by PCA visualizations you've done?

Yes, but they're not in the paper. (I also don't remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)

I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.

It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).

More broadly, it seems like you're saying that you think in general, when LLMs have linearly-represented features  and  they will also tend to linearly represent the feature . Taking this as an empirical claim about current models, this would be shocking.

I don't want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it's usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it's especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).

My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven't investigated this and I think of it as speculation.

For example, if I've done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify  vs  on a dataset where , the resulting probe should get ~50% accuracy on a test dataset where . And this should apply for any features . But this is certainly not the typical case, at least as far as I can tell!

If you linearly represent , and , then given this training setup you could learn a classifier that detects the  direction or the  direction or some mixture between the two. In general I would expect that the  direction is more prominent / more salient / cleaner than the  direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.

If you use normalization to eliminate the  direction as done in CCS, then I expect you learn a classifier aligned with the  direction, and you get ~0% accuracy on the test dataset. This isn't the typical result, but it also isn't the typical setup; it's uncommon to use normalization to eliminate particular directions.

(Similarly, if you don't do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the  probe, rather than the  probe.)

Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always "true" or "false" and the second word is always "banana" or "shed," do you predict that a probe trained with logistic regression on the dataset  will have poor accuracy when tested on ?

These datasets are incredibly tiny (size two) so I'm worried about noise, but let's say you pad the prompts with random sentences from some dataset to get larger datasets.

If you used normalization to remove the  direction, then yes, that's what I'd predict. Without normalization I predict high test accuracy.

(Note there's a typo in your test dataset -- it should be .)

Comment by Rohin Shah (rohinmshah) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T09:33:42.069Z · LW · GW

(To summarize the parallel thread)

The claim is that the learned probe is . As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier .*

You might be surprised that this is possible, because the CCS normalization is supposed to eliminate  -- but what the normalization does is remove linearly-accessible information about . However,  is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization.

*Notation:

 is a question or statement whose truth value we care about

 is one half of a contrast pair created from 

 is 1 if the statement ends with banana, and 0 if it ends with shed

 is 1 if the contrast pair is negative (i.e. ends with "False" or "No") and 0 if it is positive.

Comment by Rohin Shah (rohinmshah) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T09:22:00.191Z · LW · GW

The point is that while the normalization eliminates , it does not eliminate , and it turns out that LLMs really do encode the XOR linearly in the residual stream.

Why does the LLM do this? Suppose you have two boolean variables  and . If the neural net uses three dimensions to represent , and , I believe that allows it to recover arbitrary boolean functions of  and  linearly from the residual stream. So you might expect the LLM to do this "by default" because of how useful it is for downstream computation. In such a setting, if you normalize based on , that will remove the  direction, but it will not remove the  and  directions. Empirically when we do PCA visualizations this is what we observe.

Note that the intended behavior of CCS on e.g. IMDb is to learn the probe , so it's not clear how you'd fix this problem with more normalization, without also breaking the intended use case.

In terms of the paper: Theorems 1 and 2 describe the distractor probe, and in particular they explicitly describe the probe as learning , though it doesn't talk about why this defeats the normalization.

Note that the definition in that theorem is equivalent to .

Comment by Rohin Shah (rohinmshah) on How do you feel about LessWrong these days? [Open feedback thread] · 2023-12-11T16:55:41.874Z · LW · GW

Thanks for the edit :)

As I mentioned elsewhere (not this website) I don't agree with "will reliably lead people to false beliefs", if we're talking about ML people rather than LW people (as was my audience for that blog post).

I do think that it's a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).

Comment by Rohin Shah (rohinmshah) on How do you feel about LessWrong these days? [Open feedback thread] · 2023-12-10T14:47:31.111Z · LW · GW

Like, just look at this quote from the post you mentioned:

Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out. 

And you probably didn't even select that post for this particular misunderstanding.

(Presumably you are talking about how reward is not the optimization target.)

While I agree that the statement is not literally true, I am still basically on board with that sentence and think it's a reasonable shorthand for the true thing.

I expect that I understood the "reward is not the optimization target" point at the time of writing that post (though of course predicting what your ~5-years-ago self knew is quite challenging without specific quotes to refer to).

I am confident I understood the point by the time I was working on the goal misgeneralization project (late 2021), since almost every example we created involved predicting ahead of time a specific way in which reward would fail to be the optimization target.

Comment by Rohin Shah (rohinmshah) on How do you feel about LessWrong these days? [Open feedback thread] · 2023-12-10T14:42:33.618Z · LW · GW

Some thoughts on my journey in particular:

  1. When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as "construct a utility function for an AI system to optimize", with a key challenge being the fragility of value. In hindsight this was clearly wrong.
    1. The Value Learning sequence was in large part a result of my journey away from the utility function framing.
    2. That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see next point).
      1. (I did continue some projects more motivated from a fragility-of-value perspective, partly out of a heuristic of actually finishing things I start, and partly because I needed to write a PhD thesis.)
  2. Early on, I thought of generalization as a key issue for deep learning and expected that vanilla deep learning would not lead to AGI for this reason. Again, in hindsight this was clearly wrong.
    1. I was extremely surprised by OpenAI Five in June 2018 (not just that it worked, but also the ridiculous simplicity of the methods, in particular the lack of any hierarchical RL) and had to think through that.
    2. I spent a while trying to understand that (at least months, e.g. you can see me continuing to be skeptical of deep learning in this Dec 2018 post).
    3. I think I ended up close to my current views around early-to-mid-2019, e.g. I still broadly agree with the things I said in this August 2019 conversation (though I'll note I was using "mesa optimizer" differently than it is used today -- I agree with what I meant in that conversation, though I'd say it differently today).
      1. I think by this point I was probably less worried about fragility of value. E.g. in that conversation I say a bunch of stuff that implies it's less of a problem, most notably that AI systems will likely learn similar features as humans just from gradient descent, for reasons that LW would now call "natural abstractions".

Note that this comment is presenting the situation as a lot cleaner than it actually was. I would bet there were many ways in which I was irrational / inconsistent, probably some times where I would have expressed verbally that fragility of value wasn't a big deal but would still have defended research projects centered around it from some other perspective, etc.

Some thoughts on how to update based on past things I wrote:

  1. I don't think I've ever thought of myself as largely agreeing with LW: my relationship to LW has usually been "wow, they seem to be getting some obvious stuff wrong" (e.g. I was persuaded of slow takeoff basically when Paul's post and AI Impacts' post came out in Feb 2018, the Value Learning sequence in late 2018 was primarily in response to my perception that LW was way too anchored on the "construct a utility function" framing).
  2. I think you don't want to update too hard on the things that were said on blog posts addressed to an ML audience, or in papers that were submitted to conferences. Especially for the papers there's just a lot of random stuff you couldn't say about why you're doing the work because then peer reviewers will object (e.g. I heard second hand of a particularly egregious review to the effect of: "this work is technically solid, but the motivation is AGI safety; I don't believe in AGI so this paper should be rejected").
Comment by Rohin Shah (rohinmshah) on Incidental polysemanticity · 2023-12-08T08:06:27.349Z · LW · GW

Good point on the rotational symmetry, that makes sense now.

I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.

Agreed that's a plausible hypothesis. I mostly wish that in this toy model you had a hyperparameter for the frequency of co-occurrence of features, and identified how it affects the rate of incidental polysemanticity.

Comment by Rohin Shah (rohinmshah) on The “no sandbagging on checkable tasks” hypothesis · 2023-12-06T09:33:34.671Z · LW · GW

I think I agree with all of that (with the caveat that it's been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).

Comment by Rohin Shah (rohinmshah) on Incidental polysemanticity · 2023-12-03T13:19:09.691Z · LW · GW

My guess is that this result is very sensitive to the design of the training dataset:

the input/output data pairs are  for , where  is the  basis vector.

In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input. I'd be interested to see experiments where each feature is turned on with some (not too small) probability, independently of all other features, similarly to the original toy models setting. This would result in some inputs where feature i and j are on simultaneously. My prediction would be that polysemanticity goes down very significantly (probably to zero if the probabilities are high enough and the training is done for long enough).

I also don't understand why L1 regularization on activations is necessary to show incidental polysemanticity given your setup. Even if you remove the L1 regularization on activations, it is still the case that "benign collisions" impose no cost on the model, since feature i and feature j are never simultaneously present in a given input. So if you do get a benign collision, what causes it to go away? Overall my expectation would be that without the L1 regularization on activations (and with the training dataset as described in this post), you'd get a complicated mess where every neuron is highly polysemantic, i.e. even more polysemanticity than described in this post. Why is that wrong?

Comment by Rohin Shah (rohinmshah) on How useful is mechanistic interpretability? · 2023-12-02T21:18:26.860Z · LW · GW

One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.

When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).

It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods). I do expect we can improve; we're very far from the 99% standard. But the way we improve won't be by "drilling into the residual"; that has been tried and is insufficient. EDIT: Possibly by "drill into the residual" you mean "understand why the methods don't work and then improve them" -- if so I agree with that but also think this is what mech interp researchers want to do.

(Why am I still optimistic about interpretability? I'm not convinced that the 99% standard is required for downstream impact -- though I am pretty pessimistic about the "enumerative safety" story of impact, basically for the same reasons as Buck and Ryan afaict.)

Comment by Rohin Shah (rohinmshah) on How useful is mechanistic interpretability? · 2023-12-01T08:37:13.841Z · LW · GW

This seems like exactly what mech interp is doing? Circuit finding is all about finding sparse subgraphs. It continues to work with large models, when trying to explain a piece of the behavior of the large model. SAE stands for sparse autoencoder: the whole point is to find the basis in which you get sparsity. I feel like a lot of mech interp has been almost entirely organized around the principle of modularity / sparsity, and the main challenge is that it's hard (you don't get to 99% of loss recovered, even on pieces of behavior, while still being meaningfully sparse).

Comment by Rohin Shah (rohinmshah) on Focus on the Hardest Part First · 2023-10-01T14:38:43.552Z · LW · GW

See also Research as a Stochastic Decision Process

Comment by Rohin Shah (rohinmshah) on Three ways interpretability could be impactful · 2023-09-18T15:37:54.163Z · LW · GW

This has been stated many times before (I believe I heard it in Chris Olah’s 80k episode first) but worth reiterating.

The reference I like best is https://colah.github.io/notes/interp-v-neuro/

Comment by Rohin Shah (rohinmshah) on Explaining grokking through circuit efficiency · 2023-09-18T15:18:34.419Z · LW · GW

Unless by "shrugs" you mean the details of what the partial hypothesis says in this particular case are still being worked out.

Yes, that's what I mean.

I do agree that it's useful to know whether a partial hypothesis says anything or not; overall I think this is good info to know / ask for. I think I came off as disagreeing more strongly than I actually did, sorry about that.

Do you have any plans to do this?

No, we're moving on to other work: this took longer than we expected, and was less useful for alignment than we hoped (though that part wasn't that unexpected, from the start we expected "science of deep learning" to be more hits-based, or to require significant progress before it actually became useful for practical proposals).

How much time do you think it would take?

Actually running the experiments should be pretty straightforward, I'd expect we could do them in a week given our codebase, possibly even a day. Others might take some time to set up a good codebase but I'd still be surprised if it took a strong engineer longer than two weeks to get some initial results. This gets you observations like "under the particular settings we chose, D_crit tends to increase / decrease as the number of layers increases".

The hard part is then interpreting those results and turning them into something more generalizable -- including handling confounds. For example, maybe for some reason the principled thing to do is to reduce the learning rate as you increase layers, and once you do that your observation reverses -- this is a totally made up example but illustrates the kind of annoying things that come up when doing this sort of research, that prevent you from saying anything general. I don't know how long it would take if you want to include that; it could be quite a while (e.g. months or years).

And do you have any predictions for what should happen in these cases?

Not really. I've learned from experience not to try to make quantitative predictions yet. We tried to make some theory-inspired quantitative predictions in the settings we studied, and they fell pretty flat.

For example, in our minimal model in Section 3 we have a hyperparameter  that determines how param norm and logits scale together -- initially, that was our guess of what would happen in practice (i.e. we expected circuit param norm <> circuit logits to obey a power law relationship in actual grokking settings). But basically every piece of evidence we got seemed to falsify that hypothesis (e.g. Figure 3 in the paper).

(I say "seemed to falsify" because it's still possible that we're just failing to deal with confounders in some way, or measuring something that isn't exactly what we want to measure. For example, Figure 3 logits are not of the Mem circuit in actual grokking setups, but rather the logits produced by networks trained on random labels -- maybe there's a relevant difference between these.)

Comment by Rohin Shah (rohinmshah) on Explaining grokking through circuit efficiency · 2023-09-10T08:40:56.596Z · LW · GW

Which of these theories [...] can predict the same "four novel predictions about grokking" yours did? The relative likelihoods are what matters for updates after all.

I disagree with the implicit view on how science works. When you are a computationally bounded reasoner, you work with partial hypotheses, i.e. hypotheses that only make predictions on a small subset of possible questions, and just shrug at other questions. This is mostly what happens with the other theories:

  1. Difficulty of representation learning: Shrugs at our prediction about  /  efficiencies, anti-predicts ungrokking (since in that case the representation has already been learned), shrugs at semi-grokking.
  2. Scale of parameters at initialisation: Shrugs at all of our predictions. If you interpret it as making a strong claim that scale of parameters at initialisation is the crucial thing (i.e. other things mostly don't matter) then it anti-predicts semi-grokking.
  3. Spikes in loss / slingshots: Shrugs at all of our predictions.
  4. Random walks among optimal solutions: Shrugs at our prediction about  /  efficiencies. I'm not sure what this theory says about what happens after you hit the generalising solution -- can you then randomly walk away from the generalising solution? If yes, then it predicts that if you train for a long enough time without changing the dataset, a grokked network will ungrok (false in our experiments, and we often trained for much longer than time to grok); if no then it anti-predicts ungrokking and semi-grokking.
  5. Simplicity of the generalising solution: This is our explanation. Our paper is basically an elaboration, formalization, and confirmation of Nanda et al's theory, as we allude to in the next sentence after the one you quoted.

how does this theory explain other grokking related pheonmena e.g. Omni-Grok?

My speculation for Omni-Grok in particular is that in settings like MNIST you already have two of the ingredients for grokking (that there are both memorising and generalising solutions, and that the generalising solution is more efficient), and then having large parameter norms at initialisation provides the third ingredient (generalising solutions are learned more slowly), for some reason I still don't know.

Happy to speculate on other grokking phenomena as well (though I don't think there are many others?)

And how do things change as you increase parameter count?

We haven't investigated this, but I'd pretty strongly predict that there mostly aren't major qualitative changes. (The one exception is semi-grokking; there's a theoretical reason to expect it may sometimes not occur, and also in practice it can be quite hard to elicit.)

I expect there would be quantitative changes (e.g. maybe the value of  changes, maybe the time taken to learn  changes). Sufficiently big changes in  might mean you don't see the phenomena on modular addition any more, but I'd still expect to see them in more complicated tasks that exhibit grokking.

I'd be interested in investigations that got into these quantitative questions (in addition to the above, there's also things like "quantitatively, how does the strength of weight decay affect the time for  to be learned?", and many more).

Comment by Rohin Shah (rohinmshah) on Explaining grokking through circuit efficiency · 2023-09-10T08:07:22.219Z · LW · GW

From page 6 of the paper:

Ungrokking can be seen as a special case of catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990), where we can make much more precise predictions. First, since ungrokking should only be expected once , if we vary  we predict that there will be a sharp transition from very strong to near-random test accuracy (around ). Second, we predict that ungrokking would arise even if we only remove examples from the training dataset, whereas catastrophic forgetting typically involves training on new examples as well. Third, since  does not depend on weight decay, we predict the amount of “forgetting” (i.e. the test accuracy at convergence) also does not depend on weight decay.

(All of these predictions are then confirmed in the experimental section.)

Comment by Rohin Shah (rohinmshah) on Explaining grokking through circuit efficiency · 2023-09-09T09:29:13.670Z · LW · GW

I think that post has a lot of good ideas, e.g. the idea that generalizing circuits get reinforced by SGD more than memorizing circuits at least rhymes with what we claim is actually going on (that generalizing circuits are more efficient at producing strong logits with small param norm). We probably should have cited it, I forgot that it existed.

But it is ultimately a different take and one that I think ends up being wrong (e.g. I think it would struggle to explain semi-grokking).

I also think my early explanation, which that post compares to, is basically as good or better in hindsight, e.g.:

  1. My early explanation says that memorization produces less confident probabilities than generalization. Quintin's post explicitly calls this out as a difference, I continued to endorse my position in the comments. In hindsight my explanation was right and Quintin's was wrong, at least if you believe our new paper.
  2. My early explanation relies on the assumption that there are no "intermediate" circuits, only a pure memorization and pure generalization circuit that you can interpolate between. Again, this is called out as a difference by Quintin's post, and I continued to endorse my position in the comments. Again, I think in hindsight my explanation was right and Quintin's was wrong (though this is less clearly implied by our paper, and I could still imagine future evidence overturning that conclusion, though I really don't expect that to happen).
  3. On the other hand, my early explanation involves a random walk to the generalizing circuit, whereas in reality it develops smoothly over time. In hindsight, my explanation was wrong, and Quintin's is correct.
Comment by Rohin Shah (rohinmshah) on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-09-06T19:08:07.621Z · LW · GW

I agree -- the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn't super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.

Comment by Rohin Shah (rohinmshah) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T07:41:56.140Z · LW · GW

I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.

I can't speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be enhanced. For example:

  • It is possible to automatically make and verify claims about what topics a model is internally "thinking about" when answering a question. This is integrated into debate, and allows debaters to critique each other's internal reasoning, not just the arguments they externally make.
    • (It's unclear how much this buys you on top of cross-examination.)
  • It is possible to automatically identify "cruxes" for the model's outputs, making it easier for adversaries to design situations that flip the crux without flipping the overall correct decision.
    • Redwood's adversarial training project is roughly in this category, where the interpretability technique is saliency, specifically magnitude of gradient of the classifier output w.r.t. the token embedding.
    • (Yes, typical mech interp directions are far more detailed than saliency. The hope is that they would produce affordances significantly more helpful and robust than saliency.)
    • A different theory of change for the same affordance is to use it to analyze warning shots, to understand the underlying cause of the warning shot (was it deceptive alignment? specification gaming? mistake from not knowing a relevant fact? etc).

I don't usually try to backchain too hard from these theories of change to work done today; I think it's going to be very difficult to predict in advance what kind of affordances we might build in the future with years' more work (similarly to Richard's comment, though I'm focused more on affordances than principled understanding of deep learning; I like principled understanding of deep learning but wouldn't be doing basic research on interpretability if that was my goal).

My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build. As an example of how I reason about what projects to do, I'm now somewhat less excited about projects that do manual circuit analysis of an algorithmic task. They do still teach us new stylized facts about LLMs like "there are often multiple algorithms at different 'strengths' spread across the model" that can help with future mech interp, but overall it feels like these projects aren't pushing the boundaries as much as seems possible, because we're using the same, relatively-well-vetted techniques for all of these projects.

I'm also more keen on applying interpretability to downstream tasks (e.g. fixing issues in a model, generating adversarial examples), but not necessarily because I think it will be better than alternative methods today, but rather because I think the downstream task keeps you honest (if you don't actually understand what's going on, you'll fail at the task) and because I think practice with downstream tasks will help us notice which problems are important to solve vs. which can be set aside. This is an area where other people disagree with me (and I'm somewhat sympathetic to their views, e.g. that the work that best targets a downstream task won't tackle fundamental interp challenges like superposition as well as work that is directly trying to tackle those fundamental challenges).

(EDIT: I mostly agree with Ryan's comment, and I'll note that I am considering a much wider category of work than he is, which is part of why I usually say "interpretability" rather than "mechanistic interpretability".)


Separately, you say:

I don't see how this reduces risks more than other works that they could be doing

I'm not actually sure why you believe this. I think on the views you've expressed in this post (which, to be clear, I often disagree with), I feel like you should think that most of our work is just as bad as interpretability.

In particular we're typically in the business of building aligned models. As far as I can tell, you think that interpretability can't be used for this because (1) it is dual use, and (2) if you optimize against it, you are in part optimizing for the AI system to trick your interpretability tools. But these two points seem to apply to any alignment technique that is aiming to build aligned models. So I'm not sure what other work (within the "build aligned models" category) you think we could be doing that is better than interpretability.

(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of "indefinite, or at least very long, pause on AI progress". If that's your position I wish you would have instead written a post that was instead titled "against almost every theory of impact of alignment" or something like that.)

Comment by Rohin Shah (rohinmshah) on The “no sandbagging on checkable tasks” hypothesis · 2023-08-03T12:11:29.871Z · LW · GW

Yeah, that seems like a reasonable operationalization of "capable of doing X". So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.

Comment by Rohin Shah (rohinmshah) on The “no sandbagging on checkable tasks” hypothesis · 2023-08-02T13:28:02.354Z · LW · GW

Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn't enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?

Comment by Rohin Shah (rohinmshah) on The “no sandbagging on checkable tasks” hypothesis · 2023-08-01T07:49:40.064Z · LW · GW

The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).[1]

I think as phrased this is either not true, or tautological, or otherwise imprecisely specified (in particular I'm not sure what it means for a model to be "capable of" doing some task X -- so far papers define that to be "can you quickly finetune the model to do X"; if you use that definition then it's tautological).

Here are some hypotheticals, all of which seem plausible to me, that I think are useful test cases for your hypothesis (and would likely falsify a reasonable reading of it):

  1. You spend T time trying to prompt a model to solve a task X, and fail to do so, and declare that the model can't do X. Later someone else spends T time trying to prompt the same model to solve X, and succeeds, because they thought of a better prompt than you did.
  2. Like (1), but both you and the other person tried lots of current techniques (prompting, finetuning, chain of thought, etc).
  3. You spend $100 million pretraining a model, and then spend $1,000 of compute to finetune it, and observe it can only get a 50% success rate, so you declare it incapable of doing task X. Later you spend $1 million of compute to finetune it (with a correspondingly bigger dataset), and observe it can now get a 95% accuracy on the task.
  4. Like (3), but later you still spend $1,000 of compute to finetune it, but with a much more curated and high-quality dataset, which gets you from 50% to 95%.
  5. You evaluate GPT-4 using existing techniques and observe that it can't do task X. In 2033, somebody goes back and reevaluates GPT-4 using 2033 techniques (with the same data and similar compute, let's say) and now it does well on task X.
  6. You evaluate a model using existing techniques and observe that it can't do task X. A domain expert comes in and looks at the transcripts of the models, figures out the key things the model is struggling with, writes up a set of guidelines, and puts those in the prompt. The model can now do task X.
  7. Like (6), but the domain expert is a different AI system. (Perhaps one with a larger base model, or perhaps one that was engineered with a lot of domain-specific schlep.)

I think you probably want your hypothesis to instead say something like "if given full autonomy, the AI system could not perform better at task X than what we can elicit with currently-available techniques". (You'd also want to postulate that the AI system only gets to use a similar amount of resources for finetuning as we use.)

(At some point AI will be better at eliciting capabilities from other AI systems than unaided humans; so at that point "currently-available techniques" would have to include ones that leverage AI systems for eliciting capabilities if we wanted the statement to continue to hold. This is a feature of the definition, not a bug.)

Comment by Rohin Shah (rohinmshah) on QAPR 5: grokking is maybe not *that* big a deal? · 2023-07-26T06:49:24.633Z · LW · GW

I expect a delay even in the infinite data case, I think?

Although I'm not quite sure what you mean by "infinite data" here -- if the argument is that every data point will have been seen during training, then I agree that there won't be any delay. But yes training on the test set (even via "we train on everything so there is no possible test set") counts as cheating for this purpose.

Comment by Rohin Shah (rohinmshah) on QAPR 5: grokking is maybe not *that* big a deal? · 2023-07-25T06:21:57.027Z · LW · GW

Honestly I'd be surprised if you could achieve (2) even with explicit regularization, specifically on the modular addition task.

(You can achieve it by initializing the token embeddings to those of a grokked network so that the representations are appropriately structured; I'm not allowing things like that.)

EDIT: Actually, Omnigrok does this by constraining the parameter norm. I suspect this is mostly making it very difficult for the network to strongly memorize the data -- given the weight decay parameter the network "tries" to learn a high-param norm memorizing solution, but then repeatedly runs into the parameter norm constraint -- and so creates a very strong reason for the network to learn the generalizing algorithm. But that should still count as normal regularization.

Comment by Rohin Shah (rohinmshah) on Why was the AI Alignment community so unprepared for this moment? · 2023-07-15T17:14:33.862Z · LW · GW

To add on my thinking in particular: my view for at least a couple of years was that alignment would go mainstream at some point and discourse quality would then fall. I didn't really see a good way for me to make the public discourse much better -- I am not as gifted at persuasive writing as (say) Eliezer, nor are my views as memetically fit. As a result, my plan has been to have more detailed / nuanced conversations with individuals and/or small groups, and especially to advise people making important decisions (and/or make those decisions myself), and that was a major reason I chose to work at an industry lab. I think that plan has fared pretty well, but you're not going to see much evidence of that publicly.

I was, however, surprised by the suddenness with which things changed; had I concretely expected that I would have wanted the community to have more "huge asks" ready in advance. (I was instead implicitly thinking that the strength of the community's asks would ratchet upwards gradually as more and more people were convinced.)

Comment by Rohin Shah (rohinmshah) on Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm · 2023-07-02T12:18:04.183Z · LW · GW

In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis. 

I disagree at least with the "almost completely" version of this claim:

Notice that the operation you want to learn is manifestly a convolution operation, i.e.
 

This also applies to the non-modular addition operation, but I think it's pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the "tokenized" representation back into the "magnitude" representation, and then simply adds them as normal.

Some evidence for this:

  • I've heard it claimed that an LLM represented the number of characters in a token as a magnitude, which was used for deciding whether a line of code was > 80 characters (useful for predicting line breaks).
  • This paper trains on non-modular addition and gets this result. (Note however that the paper has a highly unusual setting that isn't representative of typical network training, and arguably the setup is such that you have to get this result: in particular the architecture enforces that the embeddings of the two inputs are added together, which wouldn't work in a Fourier basis. I cite it as evidence that networks do learn magnitude representations when forced to do so.)

It seems like if you believe "when the operation you want to learn is a convolution operation, then you will learn the Fourier basis", you should also believe that you'll get a Fourier basis for non-modular addition on one-hot-encoded numbers, and currently my guess is that that's false.

Fwiw, I agree that the algorithm is quite "mathematically natural" (indeed, one person came up with the algorithm independently, prompted by "how would you solve this problem if you were a Transformer"), though I feel like the "modular" part is pretty crucial for me (and the story I'd tell would be the one in Daniel's comment).

Comment by Rohin Shah (rohinmshah) on TurnTrout's shortform feed · 2023-07-01T14:32:11.778Z · LW · GW

Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.

I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as providing any evidence about whether realistic trained systems would be goal-directed.

Just to check that I wasn't falling prey to hindsight bias, I looked through our Slack history. Most of it is about the technical details of the results, so not very informative, but the few conversations on higher-level discussion I think overall support this picture. E.g. here are some quotes (only things I said):

Nov 3, 2019:

I think most formal / theoretical investigation ends up fleshing out a conceptual argument I would have accepted, maybe finding a few edge cases along the way; the value over the conceptual argument is primarily in the edge cases, getting more confidence, and making it easier to argue with

Dec 11, 2019:

my prediction is that agents will behave as though their reward is time-dependent / history-dependent, like humans do

We will deploy agents whose revealed specification / reward if we take the intentional stance towards them are non-Markovian