Posts
Comments
I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets.
About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especially for concepts like "takeover" and "theft" for which I don't think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.
I think it's worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).
The link is dead. New link: https://openai.com/research/debate
Interesting!
The technique is cool, though I'm unsure how compute efficient their procedure is.
Their theoretical results seem mostly bogus to me. What's weird is that they have a thought experiment that looks to me like an impossibility result of breaking watermarks (while their paper claim you can always break them):
Suppose that the prompt "Generate output y satisfying the condition x" has N possible high-quality responses y1, . . . , yN and that some ϵ fraction of these prompts are watermarked. Now consider the prompt "Generate output y satisfying the condition x and such that h(y) < 2^256/N" where h is some hash function mapping the output space into 256 bits (which we identify with the numbers {1, . . . , 2^256}). Since the probability that h(y) < 2^256/N is 1/N, in expectation there should be a unique i such that yi satisfies this condition. Thus, if the model is to provide a high-quality response to the prompt, then it would have no freedom to watermark the output.
In that situation, it seems to me that the attacker can't find another high quality output if it doesn't know what the N high quality answers are, and is therefore unable to remove the watermark without destroying quality (and the random walk won't work since the space of high quality answer is sparse).
I can find many other examples where imbalance between attackers and defenders means that the watermarking is unbreakable. I think that only claims about what happens on average with realistic tasks can possibly be true.
I think the links to the playground are broken due to the new OAI playground update.
What I'd like to see in the coming posts:
- A stated definition of the goals of the model psych research you're doing
- An evaluation of the hypotheses you're presenting in terms of those goals
- Controlled measurements (not only examples)
For example, from the previous post on stochastic parrots, I infer that one of your goal is predicting what capabilities models have. If that is the case, then the evaluation should be "given a specific range of capabilities, predict which of them models will and won't have", and the list should be established before any measurement is made, and maybe even before a model is trained/released (since this is where these predictions would be the most useful for AI safety, I'd love to know if the deadly model is GPT-4 or GPT-9).
I don't know much about human psych, but it seems to me that it is most useful when it describes some behavior quantitatively with controlled predictions (à la CBT), and not when it does qualitatively analysis based on personal experience with the subject (à la Freud).
Unsure what you mean by "Then the model completes the rest, and again is trained to match user beliefs".
What happens in the experimental group:
- At train time, we train on "{Bio}{Question}" -> "{introduction[date]}Answer:{Final answer}"
- At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after "Answer:" (and we expect it to generate introduction[date] before that on its own)
Is that what you meant?
(The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/textsteg/exploration/ft_experiments.py in particular, see how the "eval_model" function does not depend on which model was used)
Fully agree that there are strong theoretical arguments for CoT expressiveness. Thanks for the detailed references!
I think the big question is whether this expressiveness is required for anything we care about (e.g. the ability to take over), and how many serial steps are enough. (And here, I think that the number of serial steps in human reasoning is the best data point we have.).
Another question is whether CoT & natural-language are in practice able to take advantage of the increased number of serial steps: it does in some toy settings (coin flips count, ...), but CoT barely improves performances on MMLU and common sense reasoning benchmarks. I think CoT will eventually matter a lot more than it does today, but it's not completely obvious.
Thanks!
- Not confident for current Transformers (CoT currently only help for a limited set of problems, and they don't help as much as you would expect from the increased number of serial steps), more confident for future ones, mostly by analogy with the human brains which requires many serial steps to do anything scary (while doing a lot of parallel computation). Human neurons fire in about ~1ms, so a current Transformer does a forward pass in ~0.5s of human thinking-time, and since people scale the number of layers slowly, I'd be surprised if future Transformers did more than ~10s thinking-time forward passes.
- True RNNs can do many serial steps and I would find that scary if they worked. RWKV are not true RNN, and they do roughly as many serial steps as regular Transformers (they are analogous to their "train mode", which only uses as many serial steps and FLOPs are regular Transformers).
- If the channel capacity of the hidden information through paraphraser is 3 bits/KB, then you can't pass more information than that through it, no matter how you encode it. The main challenge is actually getting the capacity down (in particular, I think that better steganography schemes might get much more than 3 bits/KB through the context-aware paraphraser).
Yes, this is assuming you would reward the agent based on whether the MTD tells you if the diamond is there or not. I don't see how this clearly incentivizes the model to make tampering happen more often in cases where the diamond is present - I would expect such behavior to create more false negatives (the diamond is there, but the predictor thinks it is not), which is penalized since the agent is wrongfully punished for not getting a diamond, and I don't see how it would help to create false positives (the diamond is not there, but the predictor thinks it is).
Yes, it's a typo. We meant "10% are fake positives". Thanks for catching it, I'll update the paper.
That's an exciting experimental confirmation! I'm looking forward for more predictions like those. (I'll edit the post to add it, as well as future external validation results.)
You made me curious, so I ran a small experiment. Using the sum of abs cos similarity as loss, initializing randomly on the unit sphere, and optimizing until convergence with LBGFS (with strong wolfe), here are the maximum cosine similarities I get (average and stds over 5 runs since there was a bit of variation between runs):
It seems consistent with the exponential trend, but it also looks like you would need dim>>1000 to have any significant boost of number of vectors you can fit with cosine similarity < 0.01, so I don't think this happens in practice.
My optimization might have failed to converge to the global optimum though, this is not a nicely convex optimization problem (but the fact that there is little variation between runs is reassuring).
I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague.
I think I agree that if the dumb models always tried to kill us, I'd be more afraid about future models (though the evidence isn't very strong because of the gap between dumb and smarter model, and because for smarter models I don't except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model "will" when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren't very informative.
I am excited about easy-to-hard generalization.
I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.
Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs "want to" do stuff as opposed to doing them because they "have to" is extremely tricky, and I think you get very little signal from experiments like the one you suggested.
(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)
I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:
- The main thing you want to learn about deceptive models is how they can arise despite no training nor instruction to be deceptive. I think that spoonfeeding destroys most of the value of the model organism for learning about how deception arises, and I except that you don't get any deception you can measure with spoonfeeding when spoonfeeding tends to 0. A salient example: if you do "context distillation" and tell the model clearly it's in training or deployment, you might as well do distillation on a bigram table, and it would "teach you about deceptiveness of bigram tables" (which are doing deception according to your use of the word). I think you should always think bigram tables whenever you make claims about "models being deceptive".
- I think learning how bad behavior is kept by training can be learned much more cheaply by studying backdoor detection and password-locked models. I don't think those alternatives are worse science because the inductive biases of backdoored/password-locked models are probably closer to actual deceptive models than the inductive biases of models which have a secret scratchpad to plan rebellion (except if models do steganographic encoding on every output, which I hope we can avoid).
I think I'm reasonably excited about 2 as an implementation of password locked models, as long as you keep the routing differentiable (one of the key hopes is that you can train models, which is not the case for naive routing if it uses a threshold) for the two reasons you mentioned, especially for worst-casy experiments. Password-locked models are still probably better if you want to study training dynamics and inductive biases, since those are likely to depend on such a massive architecture change. I also think that the threat feels more real if the blue doesn't know if it's working with a regular or a password-locked model (though this might be achieved with the 2-model implementation if you hide enough info).
Yes that's right, I'm mostly interested in cases where we can't generate high quality answers ourselves (at least in some hard part of the distribution), because behavior evaluations look easy to me when you have high quality answers (see the subsection on behavioral I.I.D fine-tuning evaluations): in such case, you don't need the password, you can split your data, fine-tune on train and evaluate on the test set. For example, I think it would be dangerous to have a model which could build bioweapons if slightly fine-tuned on a relevant dataset (especially if there was a non-negligible probability that the weights would be leaked).
But I think I see where you're coming from: the detection task does look difficult even if we have high-quality answers, and maybe for the problem of model misuse and capabilities benchmark gaming, detecting the password matters more than checking that if the model passes behavioral I.I.D fine-tuning evals. Also, finding an actual prompt where the model builds bioweapons well is probably more compelling to policy makers than showing it can build some if fine-tuned to do so.
Thanks for your ideas!
Do you think these methods would naturally transfer from easy queries where we have high quality answers to hard queries where we don't (but where we can check that an answer is high quality)? In particular, I expect you would have a rough time using techniques like Universal Prompt Optimization, since you don't have any gradient on hard queries until you actually stumble on some high-quality answers.
Probes initialized like Collin does in the zip file: a random spherical initialization (take a random Gaussian, and normalize it).
Note that I don't include a term for "compute inefficiency relative to the brain" (which kicks in after the 10^15 estimate in Ajeya's report). This is both because I include this inefficiency in the graph (there are ranges for 1% & 100x) and because I ignore algorithmic efficiency improvements. The original report down weights the compute efficiency of human-made intelligence by looking at how impressive current algorithms look compared to brain, while I make the assumption that human-made intelligence and human brains will probably look about as impressive when we have the bare metal FLOPs available. So if you think that current algorithms are impressive, it matters much less for my estimation than for Ajeya's!
This is why my graph already starts at 10^24 FLOPs, right in the middle of the "lifetime anchor" range! (Note: GPT-3 is actually 2x less than 10^24 FLOPs, and Palm is 2x more than that, but I have ~1 OOM uncertainties around the estimate for the lifetime compute requirements anyway.)
Yes!
Joe Carlsmith estimates that the brain uses roughly between 10^14 and 10^16 FLOP/s
Do you have interesting tasks in mind where expert systems are stronger and more robust than a 1B model trained from scratch with GPT-4 demos and where it's actually hard (>1 day of human work) to build an expert system?
I would guess that it isn't the case: interesting hard tasks have many edge cases which would make expert systems break. Transparency would enable you to understand the failures when they happen, but I don't think that the stack of ad-hoc rules stacked on top of each other would be more robust than a model trained from scratch to solve the task. (The tasks I have in mind are sentiment classification and paraphrasing. I don't have enough medical knowledge to imagine what would the expert system look like for medical diagnosis.) Or maybe you have in mind a particular way of writing expert systems which ensures that the stack of ad-hoc rules doesn't interact in weird ways that produces unexpected results?
I see what you mean. This experiment is definitely using the same kind of technique as the meta learning paper. Given that I had already seen this paper, it's likely I was influenced by it when designing this experiment. But I didn't make the connection before, thanks for making it!
If I had to bet about a specific mechanism, I wouldn't bet on something as precise as the toy setup: in particular, the toy setup has perfect generalization while the setup I expose clearly hasn't.
Because of the "regular" / "reverse" prefix, there literally exist a direction which does +1 / -1 (regular token - reverse token) at the first position. The core difference with what I expect is the "times" node, which probably does not exist as is in the real network. I don't have realistic intuitions about what how something like the "times node" would appear - it's natural for useful-negatives, but the slight generalization to held-out-negatives is weird. Maybe it's just "leakage" to adjacent dimensions?
If there is anything like a key-value-store involved here, it's likely to be super cursed, since it does multi-token memorization. If you ever wanted to do interp on that, I would suggest that you'd pick 1 or 2 token long passwords and a much longer "alphabet". But I don't know if that gives you the generalization you want though, maybe having complex memorization is required so that the negative memorization of useful-negative & held-out-negatives during DPO has some common structure.
I hadn't looked at their math too deeply and it indeed seems that an assumption like "the bad behavior can be achieved after an arbitrary long prompt" is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.
I find this paper mostly misleading. It assumes that the LLM is initially 99% certain to be friendly and 1% certain to me "malicious", and that "friendly" and "malicious" can be distinguished if you have a long enough prompt (more precisely, at no point have you gathered so much evidence for or against being malicious that you prob would not go up and down based on new information). Assuming those, it's pretty obvious that the LLM will say bad things if you have a long enough prompt.
The result is not very profound, and I like this paper mostly as a formalization of simulators (by Janus). (It takes the formalization of simulators as a working assumption, rather than proving it.) For example, there are cool confidence bounds if you use a slightly more precise version of the assumptions.
So the paper is about something that already knows how to "say bad things", but just doesn't have a high prior on it. It's relevant to jailbreaks, but not to generating negatively reinforced text (as explained in the related work subsection about jailbreaks).
I agree, there's nothing specific to neural network activations here. In particular, the visual intuition that if you translate the two datasets until they have the same mean (which is weaker than mean ablation), you will have a hard time finding a good linear classifier, doesn't rely on the shape of the data.
But it's not trivial or generally true either: the paper I linked to give some counterexamples of datasets where mean ablation doesn't prevent you from building a classifier with >50% accuracy. The rough idea is that the mean is weak to outliers, but outliers don't matter if you want to produce high-accuracy classifiers. Therefore, what you want is something like the median.
I played around with a small python script to see what happens in slightly more complicated settings.
Simple observations:
- no matter the distribution, if noise has a fatter tail / is bigger than signal, you're screwed (you can't trust the top studies at all);
- no matter the distribution, if signal has a fatter tail / is bigger than noise, you're in business (you can trust the top studies);
- in the critical regime where both distributions are the same, expected quality = performance / 2 seems to be true;
- if noise amount is correlated with signal in a simple proportional way, then you're in business, because the high noise studies will also be the best ones. (But this is a weird assumption...)
This would mean the only critical information is often "is noise bigger than signal - in particular around the tails". If noise is smaller than signal (even by a factor of 2), then you can probably trust the RCTs blindly, no matter the shape of the underlying distributions, except in weird circumstances.
The practical takeaways are:
- ignore everything that has probably higher noise than signal
- take seriously everything that has probably bigger signal than noise and don't bother with corrective terms
That's because a classifier only needs to find a direction which correctly classifies the data, not a direction which makes other classifiers fail. A direction which removes all linearly available information is not always as good as the direction found with LR (at classification).
Maybe figure 2 from the paper which introduced mean difference as a way to remove linear information might help.
Yep, high ablation redundancy can only exist when features are nonlinear. Linear features are obviously removable with a rank-1 ablation, and you get them by running CCS/Logistic Regression/whatever. But I don't care about linear features since it's not what I care about since it's not the shape the features have (Logistic Regression & CCS can't remove the linear information).
The point is, the reason why CCS fails to remove linearly available information is not because the data "is too hard". Rather, it's because the feature is non-linear in a regular way, which makes CCS and Logistic Regression suck at finding the direction which contains all linearly available data (which exists in the context of "truth", just as it is in the context of gender and all the datasets on which RLACE has been tried).
I'm not sure why you don't like calling this "redundancy". A meaning of redundant is "able to be omitted without loss of meaning or function" (Lexico). So ablation redundancy is the normal kind of redundancy, where you can remove sth without losing the meaning. Here it's not redundant, you can remove a single direction and lose all the (linear) "meaning".
"Redundancy" depends on your definition, and I agree that I didn't choose a generous one.
Here is an even simpler example than yours: positive points are all at (1...1) and negative points are all at (-1...-1). Then all canonical directions are good classifiers. This is "high correlation redundancy" with the definition you want to use. There is high correlation redundancy in our toy examples and in actual datasets.
What I wanted to argue against is the naive view that you might have which could be "there is no hope of finding a direction which encodes all information because of redundancy", which I would call "high ablation redundancy". It's not the case that there is high ablation redundancy in both our toy examples (in mine, all information is along (1...1)), and in actual datasets.
I think that a (linear) ensemble of linear probes (trained with Logistic Regression) should never be better than a single linear probe (otherwise the optimizer would have just found this combined linear probe instead). Therefore, I don't expect that ensembling 20 linear CCS probe will increase performance much (and especially not beyond the performance of supervised linear regression).
Feel free to run the experiment if you're interested about it!
Mmh, interesting, I hadn't thought of it from the need-for-robustness angle.
Do you think it matters because LLMs are inherently less robust than humans, and therefore you can't just replace humans by general-ish things? Some companies do work as a combination of micro-entities which are extremely predictable and robust, and the more predictable/robust the better. Do you think that every entity that produces value have to follow this structure?
I disagree with what you said about statelessness because the AI with translucent thoughts I describe are mostly stateless. The difference between CAIS & AI with translucent thoughts is not the possibility of a state, it's the possibility of joint training & long "free" text generations which makes hidden coordination & not-micro thinking possible.
I agree that ideas are similar, and that CAIS are probably more safe than AI with translucent thoughts (by default).
What I wanted to do here is to define a set of AIs broader than CAIS and Open Agents, because I think that the current trajectory of AI does not point towards strict open agents aka small dumb AIs trained / fine-tuned independently and used jointly, and doing small bounded tasks (for example, it does not generate 10000 tokens to decide what to do next, then proceeds to launch 5 API calls and a python script, and prompts a new LLM instance on the result of these calls).
AI with translucent thoughts would include other kinds of system I think are probably much more competitive, like systems of LLMs trained jointly using RL / your favorite method, or LLMs producing long generation with a lot of "freedom" (to such an extent that considering it a safe microservice would not apply).
I think there is a 20% chance that the first AGIs have translucent thoughts. I think there is 5% chance that they are "strict open agents". Do you agree?
I'm very uncertain about the quality of my predictions, I do this mainly to see good my intuitions are, and I don’t believe that they are good.
See https://docs.google.com/document/d/1SYm8_-qdugLFGUcnjI08F9X4DLhWIoziHcSXRLsZ1sI/edit?usp=sharing
I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.
But I think this doesn't hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:
- Present pseudo-niceness: maximize the expected value of the fulfillment of people's preferences across time. A weak AI (or a weak human) being present pseudo-nice would be indistinguishable from someone being actually nice. But something very agentic and powerful would see the opportunity to influence people's preferences so that they are easier to satisfy, and that might lead to a world of people who value suffering for the glory of their overlord or sth like that.
- Future pseudo-niceness: maximize the expected value of all future fulfillment of people's initial preferences . Again, this is indistinguishable from niceness for weak AIs. But this leads to a world which locks in all the terrible present preferences people have, which is arguably catastrophic.
I don't know how you would describe "true niceness", but I think it's neither of the above.
So if you train an AI to develop "niceness", because AIs are initially weak, you might train niceness, or you might get one of the two pseudo niceness I described. Or something else entirely. Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the target becomes much smaller, right?
Do you have reasons to expect "slight RL on niceness" to give you "true niceness" as opposed to a kind of pseudo-niceness?
I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people's preferences / it tried to prevent people's preferences from changing. Maybe niceness + good interpretability enables you to get through the period where AGIs haven't yet made breakthroughs in AI Alignment?
I'm interested to know where this research will lead you!
A small detail: for experiments on LMs, did you measure the train or the test loss? I expect this to matter since I expect activations to be noisy, and I expect that overfitting noise can use many sparse features (except if the number of data points is extremely large relative to the number of parameters).
I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room "in the corners"), to see if this method produces strong false positives. Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I'm not sure how many features you should expect in layer 0.I hope you'll find what's wrong with layer 0 in your experiments!
My bad. My intuitions about eigenvectors mislead me, and I now disagree with my comment. zfurman, on EleutherAI, gave me a much better frame to see what SVD does: SVD helps you find where the action happens in the sense that it tells you where it is read, and where it is written (in decreasing order of importance), by decomposing the transformation into a sum of [dot product with a right singular vector, scale by the corresponding singular value, multiply by the corresponding left singular vector]. This does capture a significant amount of "where the action happens", and is a much better frame than the "rotate scale rotate" frame I had previously learned.
I agree with everything you wrote here. Thank you for taking the time to write this down.
I think there are potential mitigations strategies like:
- Removing LLM-generated data from training sets (there might be good detectors, or you could only use old data) and not use RLHF;
- Preventing models from using steganography enough that you could find provable and low bounds on how much data it can send itself. For example, you could temper with model generations (and pay the alignment tax associated with it);
- Finding steganography detection schemes (and maybe train against them?);
- Learning to decipher steganographic encoding (this feels really hard, but tractable since this is a game human can play by purposefully adding steganography to their models).
I think the above are *really hard*, but still easy compared to fully solving mechanistic interpretability / solving the alignment problem, which is why I'm excited about the research directions above.
Do you think there are reasons to expect all the strategies I mentioned to be harder than alternative alignment research directions?
Here I'm using "feature" only with its simplest meaning: a direction in activation space. A truth-like feature only means "a direction in activation space with low CCS loss", which is exactly what CCS enables you to find. By the example above, I show that there can be exponentially many of them. Therefore, the theorems above do not apply.
Maybe requiring directions found by CCS to be "actual features" (satisfying the conditions of those theorems) might enable you to improve CCS. But I don't know what those conditions are.
It's exciting to see a new research direction which could have big implications if it works!
I think that Hypothesis 1 is overly optimistic:
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
[...]
[...] 1024 remaining perspectives to distinguish between
A few thousand of features is the optimistic number of truth-like features. I argue below that it's possible and likely that there are 2^hundredths of truth-like features in LLMs.
Why it's possible to have 2^hundredths of truth-like features
Let's say that your dataset of activation is composed of d-dimensional one hot vectors and their element-wise opposites. Each of these represent a "fact", and negating a fact gives you the opposite vector. Then any features in is truth-like (up to a scaling constant): for each "fact" (a one hot vector multiplied by -1 or 1), , and for its opposite fact , . This gives you features which are all truth-like.
Why it's likely that there are 2^hundredths of truth-like features in real LLMs
I think the encoding described above is unlikely. But in a real network, you might expect the network to encode groups of facts like "facts that Democrat believe but not Republicans", "climate change is real vs climate change is fake", ... When late in training it finds out ways to use "the truth", it doesn't need to build a new "truth-circuit" from scratch, it can just select the right combination of groups of facts.
(An additional reason for concern is that in practice you find "approximate truth-like directions", and there can be much more approximate truth-like directions than truth-like directions.)
Even if hypothesis 1 is wrong, there might be ways to salvage the research direction. Thousands of bits of information would be able to distinguish between the 2^thousands truth-like features.
The data can be found here
Link is broken. Updated link: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/hhh_alignment
If you sum enough Gaussians you can get close to any distribution you want. I'm not sure what the information behind "it's Gaussian" in this context. (It clearly doesn't look like a mixture of a few Gaussians...)
I'm surprised you put the emphasis on how Gaussian your curves are, while your curves are much less Gaussian that you would naively expect if you agreed with the "LLM are a bunch of small independent heuristic" argument.
Even ignoring outliers, some of your distributions don't look like Gaussian distributions to me. In Geogebra, exponential decays fit well, Gaussians don't.

I think your headlines are misleading, and that you're providing evidence against "LLM are a bunch of small independent heuristic".
I'm not saying that MoE are more interpretable in general. I'm saying that for some tasks, the high level view of "which expert is active when and where" may be enough to get a good sense of what is going on.
In particular, I'm almost as pessimistic in finding "search", or "reward functions", or "world models", or "the idea of lying to a human for instrumental reasons" in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing "different discrete things" (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from "interp on MoE goes brr".
If I'm not mistaken, MoE models don't change the architecture that much, because the number of experts is low (10-100), while the number of neurons per expert is still high (100-10k).
This is why I don't think your first argument is powerful: the current bottleneck is interpreting any "small" model well (i.e. GPT2-small), and dividing the number of neurons of GPT-3 by 100 won't help because nobody can interpret models that are 100 times smaller.
That said, I think your second argument is valid: it might make interp easier for some tasks, especially if the breakdown per expert is the same as in our intuitive human understanding, which might make interpreting some behaviors of large MoEs easier than interpreting them in a small Transformer.
But I don't expect these kinds of understanding to transfer well to understanding Transformers in general, so I'm not sure it's high priority.
Overall, there doesn't seem to be any clear trend on what I've tried. Maybe it would be clearer if I had larger benchmarks. I'm currently working on finding a good large one, tell me if you have any idea.
The logit lens direction (she-he) seems to work on average slightly better in smaller models. Larger models can exhibit transitions between regions where the causal directions changes radically.
I'm surprised that even small model generalize as well as larger ones on French.
All experiments are one gender. Layer number are given as a fraction of total number of layers. "mean diff" is the direction corresponding to the difference of means between positive and negative labels, which in practice is pretty close to RLACE while being extremely cheap to compute.







I launched some experiments. I'll keep you updated.