Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?
post by simeon_c (WayZ) · 2022-12-31T11:34:18.185Z · LW · GW · No commentsThis is a question post.
Contents
Answers 4 Fabien Roger 3 StellaAthena None No comments
Intuitively, I would expect Mixture-of-Experts (MoE) models (e.g. https://arxiv.org/abs/2101.03961) to be a lot more interpretable than dense transformers:
- The complexity of an interconnected system increases way faster than linearly with the number of connected units. It is probably at least quadratic. Thus, studying a system with n units is a priori way harder than studying 5 systems with n/5 units. In practice MoE transformers seem to require at least an order of magnitude more parameters than dense transformers for similar capabilities but I still expect the sum of complexity of each expert to be much lower than the complexity of one single dense transformer.
- MoE forces specialization and thus gives a strong prior on what a set of neurons is doing. Having a prior is probably very helpful to move faster in doing mechanistic interpretability.
So my question is: Do you think MoEs are more interpretable than dense transformers, and is there some evidence of it or the opposite (e.g. papers or past LW posts)?
I think this question matters because it doesn't seem implausible to me that MoE models could be at par with dense models in terms of capabilities. And thus it could be an avenue worth pursuing or promoting if we had strong evidence that they were a lot more interpretable. You can see more tentative thoughts on this here (https://twitter.com/Simeon_Cps/status/1609139209914257408?s=20)
Answers
If I'm not mistaken, MoE models don't change the architecture that much, because the number of experts is low (10-100), while the number of neurons per expert is still high (100-10k).
This is why I don't think your first argument is powerful: the current bottleneck is interpreting any "small" model well (i.e. GPT2-small), and dividing the number of neurons of GPT-3 by 100 won't help because nobody can interpret models that are 100 times smaller.
That said, I think your second argument is valid: it might make interp easier for some tasks, especially if the breakdown per expert is the same as in our intuitive human understanding, which might make interpreting some behaviors of large MoEs easier than interpreting them in a small Transformer.
But I don't expect these kinds of understanding to transfer well to understanding Transformers in general, so I'm not sure it's high priority.
↑ comment by simeon_c (WayZ) · 2022-12-31T15:42:40.694Z · LW(p) · GW(p)
But I don't expect these kinds of understanding to transfer well to understanding Transformers in general, so I'm not sure it's high priority.
The point is not necessarily to improve our understanding of Transformers in general, but that if we're pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.
Replies from: Fabien↑ comment by Fabien Roger (Fabien) · 2022-12-31T17:24:18.240Z · LW(p) · GW(p)
I'm not saying that MoE are more interpretable in general. I'm saying that for some tasks, the high level view of "which expert is active when and where" may be enough to get a good sense of what is going on.
In particular, I'm almost as pessimistic in finding "search", or "reward functions", or "world models", or "the idea of lying to a human for instrumental reasons" in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing "different discrete things" (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from "interp on MoE goes brr".
I think that the answer is no, and that this reflects a common mental barrier when dealing with gradient descent. You would like different experts to specialize in different things in a human-interpretable way, but Adam doesn’t care what you say you want. Adam only cares about what you actually write down in the loss function.
Generally, a useful line of thinking when dealing with lines of thought like this is to ask yourself if your justification for why something should happen already justifies something that is known to not happen. If so, it’s probably flawed.
In this case there is: as far as I can tell, your justification applies to multiheaded attention (as an improvement over single headed attention). While there has been some attempts to examine MHA as an interpretability magnifying technique, in practice there hasn’t really been much success. Whatever story you tell about why it should work with MoE needs to distinguish MoE from MHA.
I think this question matters because it doesn't seem implausible to me that MoE models could be at par with dense models in terms of capabilities.
There are two regimes when talking about scaling LLMs, and I think it’s very important to keep them separate when talking about things like this. The literature on scaling laws was written by researchers at a very small number of companies that have a very important and non-standard situation: they are predicated upon the assumption that using twice as many GPUs for half as long doesn’t impact costs. It’s hard to overstate how few people fall into this regime.
I run EleutherAI, the non-profit org that has trained more and larger multi-billion parameter LLMs than any other non-profit in the world, and have worked on three different models that held the title “largest publicly available GPT-3-like LLM in the world.” I have access to thousands of A100 GPUs to train models if I really want to, and recently won a USG grant for 6 million V100 hours. I generally do not operate in this regime.
The regime that almost everyone finds themselves in is one where one day the VRAM runs out. Maybe it’s at a pair of 3090 Tis, maybe it’s at a v3-8 TPU, maybe it’s at a DGX machine. But one day you lose the ability to halve your runtime by doubling the amount of VRAM you are using without impacting costs.
In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs. While there has been some success at turning dense models into MoE models with less performance loss, that work isn’t really relevant to your hypothesis without a substantial amount of additional intellectual work. MoE models are egregiously inefficient in terms of performance-per-VRAM, but compensate by being more efficient in terms of performance-per-FLOP.
How egregious exactly? Well the first MoE paper I grabbed claims that their 1.1T parameter MoE model performs similarly to a 6.7B parameter dense model and that their 207B parameter MoE model performs similarity to a 1.3B parameter model. To put these numbers in prospective: the (currently unverified) claims NVIDIA is making about quantization on their H100 GPUs would enable you to fit a 640B parameter model on an 8xH100 (80GB) device. So you can use an entire 8xH100 machine to fit a MoE model, or you can use a single 3090 Ti and get better performance (using LLM.int8).
Edit: in a reply to the other answer you say
I'm not saying that MoE are more interpretable in general. I'm saying that for some tasks, the high level view of "which expert is active when and where" may be enough to get a good sense of what is going on.
I had misread your claim, but I think the intent of my response is still valid. Even with this more specific claim, you see people aspiring to believe that this is true for MHA and coming up (largely, albeit not entirely) empty. There’s still a significant burden on you to show why your position is better than the same position with the word “MoE” replaced with “MHA.”
↑ comment by 1stuserhere (firstuser-here) · 2023-09-15T11:38:47.091Z · LW(p) · GW(p)
I think that the answer is no
In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs.
Curious whether your high-level thoughts on these topics still hold or have changed.
No comments
Comments sorted by top scores.