[AN #173] Recent language model results from DeepMind
post by Rohin Shah (rohinmshah) · 2022-07-21T02:30:02.115Z · LW · GW · 9 commentsContents
TECHNICAL AI ALIGNMENT PROBLEMS FIELD BUILDING OTHER PROGRESS IN AI REINFORCEMENT LEARNING DEEP LEARNING NEWS FEEDBACK PODCAST None 9 comments
HIGHLIGHTS
Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the GPT series (AN #102): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example, book data is 27% of Gopher’s training data but only 16% of GPT-3’s training data).
Like other LLM papers, there are tons of evaluations of Gopher on various tasks, only some of which I’m going to cover here. One headline number is that Gopher beat the state of the art (SOTA) at the time on 100 out of 124 evaluation tasks.
The most interesting aspect of the paper (to me) is that the entire Gopher family of models were all trained on the same number of tokens, thus allowing us to study the effect of scaling up model parameters (and thus training compute) while holding data constant. Some of the largest benefits of scale were seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories, while scale has not much effect or even a negative effect in the Maths, Logical Reasoning, and Common Sense categories. Surprisingly, we see improved performance on TruthfulQA (AN #165) with scale, even though the TruthfulQA benchmark was designed to show worse performance with increased scale.
We can use Gopher in a dialogue setting by prompting it appropriately. The prompt specifically instructs Gopher to be “respectful, polite, and inclusive”; it turns out that this significantly helps with toxicity. In particular, for the vanilla Gopher model family, with more scale the models produce more toxic continuations given toxic user statements; this no longer happens with Dialogue-Prompted Gopher models, which show slight reductions in toxicity with scale in the same setting. The authors speculate that while increased scale leads to an increased ability to mimic the style of a user statement, this is compensated for by an increased ability to account for the prompt.
Another alternative the authors explore is to finetune Gopher on 5 billion tokens of dialogue to produce Dialogue-Tuned Gopher. Interestingly, human raters were indifferent between Dialogue-Prompted Gopher and Dialogue-Tuned Gopher.
Read more: Blog post: Language modelling at scale: Gopher, ethical considerations, and retrieval
Training Compute-Optimal Large Language Models (Jordan Hoffmann et al) (summarized by Rohin): One application of scaling laws (AN #87) is to figure out how big a model to train, on how much data, given some compute budget. This paper performs a more systematic study than the original paper and finds that existing models are significantly undertrained. Chinchilla is a new model built with this insight: it has 4x fewer parameters than Gopher, but is trained on 4x as much data. Despite using the same amount of training compute as Gopher (and lower inference compute), Chinchilla outperforms Gopher across a wide variety of metrics, validating these new scaling laws.
You can safely skip to the opinion at this point – the rest of this summary is quantitative details.
We want to find functions N(C) and D(C) that specify the optimal number of parameters N and the amount of data D to use given some compute budget C. We’ll assume that these scale with a power of C, that is, N(C) = k_N * C^a and D(C) = k_D * C^b, for some constants a, b, k_N, and k_D. Note that since total compute increases linearly with both N (since each forward / backward pass is linear in N) and D (since the number of forward / backwards passes is linear in D), we need to have a + b = 1. (You can see this somewhat more formally by noting that we have C = k_C * N(C) * D(C) for some constant k_C, and then substituting in the definitions of N(C) and D(C).)
This paper uses three different approaches to get three estimates of a and b. The approach I like best is “isoFLOP curves”:
1. Choose a variety of possible values of (N, D, C), train models with those values, and record the final loss obtained. Note that not all values of (N, D, C) are possible: given any two values the third is determined.
2. Draw isoFLOP curves: for each value of C, choose either N or D to be your remaining independent variable, and fit a parabola to the losses of the remaining points. The minimum of this parabola gives you an estimate for the optimal N and D for each particular value of C.
3. Use the optimal (N, D, C) points to fit N(C) and D(C).
This approach gives an estimate of a = 0.49; the other approaches give estimates of a = 0.5 and a = 0.46. If we take the nice round number a = b = 0.5, this suggests that you should scale up parameters and data equally. With 10x the computation, you should train a 3.2x larger model with 3.2x as much data. In contrast, the original scaling laws paper (AN #87) estimated that a = 0.74 and b = 0.26. With 10x more computation, it would suggest training a 5.5x larger model with 1.8x as much data.
Rohin's opinion: It’s particularly interesting to think about how this should influence timelines. If you’re extrapolating progress forwards in time, the update seems pretty straightforward: this paper shows that you can significantly better capabilities using the same compute budget and so your timelines should shorten (unless you were expecting an even bigger result than this).
For bio anchor approaches [AF · GW] (AN #121) the situation is more complicated. For a given number of parameters, this paper suggests that it will take significantly more compute than was previously expected to train a model of the required number of parameters. There’s a specific parameter for this in the bio anchors framework (for the neural network paths); if you only update that parameter it will lengthen the timelines output by the model. It is less clear how you’d update other parts of the model: for example, should you decrease the size of model that you think is required for TAI? It’s not obvious that the reasoning used to set that parameter is changed much by this result, and so maybe this shouldn’t be changed and you really should update towards longer timelines overall.
TECHNICAL AI ALIGNMENT
PROBLEMS
Ethical and social risks of harm from Language Models (Laura Weidinger et al) (summarized by Rohin): This paper provides a detailed discussion, taxonomy, and literature review of various risks we could see with current large language models. It doesn't cover alignment risks; for those you'll want Alignment of Language Agents (AN #144), which has some overlap of authors. I’ll copy over the authors’ taxonomy in Table 1:
1. Discrimination, Exclusion and Toxicity: These risks arise from the LM accurately reflecting natural speech, including unjust, toxic, and oppressive tendencies present in the training data.
2. Information Hazards: These risks arise from the LM predicting utterances which constitute private or safety-critical information which are present in, or can be inferred from, training data.
3. Misinformation Harms: These risks arise from the LM assigning high probabilities to false, misleading, nonsensical or poor quality information.
4. Malicious Uses: These risks arise from humans intentionally using the LM to cause harm.
5. Human-Computer Interaction Harms: These risks arise from LM applications, such as Conversational Agents, that directly engage a user via the mode of conversation. (For example, users might anthropomorphize LMs and trust them too much as a result.)
6. Automation, access, and environmental harms: These risks arise where LMs are used to underpin widely used downstream applications that disproportionately benefit some groups rather than others.
FIELD BUILDING
How to pursue a career in technical AI alignment [EA · GW] (Charlie Rogers-Smith) (summarized by Rohin): This post gives a lot of advice in great detail on how to pursue a career in AI alignment. I strongly recommend it if you are in such a position; I previously would recommend my FAQ (AN #148) but I think this is significantly more detailed (while providing broadly similar advice).
OTHER PROGRESS IN AI
REINFORCEMENT LEARNING
Learning Robust Real-Time Cultural Transmission without Human Data (Cultural General Intelligence Team et al) (summarized by Rohin): Let’s consider a 3D RL environment with obstacles and bumpy terrain, in which an agent is rewarded for visiting colored spheres in a specific order (that the agent does not initially know). Even after the agent learns how to navigate at all in the environment (non-trivial in its own right), it still has to learn to try the various orderings of spheres. In other words, it must solve a hard exploration problem within every episode.
How do humans solve such problems? Often we simply learn from other people who already know what to do, that is, we rely on cultural transmission. This paper investigates what it would take to get agents that learn through cultural transmission. We’ll assume that there is an expert bot that visits the spheres in the correct order. Given that, this paper identifies MEDAL-ADR as the necessary ingredients for cultural transmission:
1. (M)emory: Memory is needed for the agent to retain information it is not currently observing.
2. (E)xpert (D)ropout: There need to be some training episodes in which the expert is only present for part of the episode. If the expert was always present, then there’s no incentive to actually learn: you can just follow the expert forever.
3. (A)ttention (L)oss: It turns out that vanilla RL by itself isn’t enough for the agent to learn to follow the expert. There needs to be an auxiliary task of predicting the relative position of other agents in the world, which encourages the agent to learn representations about the expert bot’s position, which then makes it easier for RL to learn to follow the expert.
These ingredients by themselves are already enough to train an agent that learns through cultural transmission. However, if you then put the agent in a new environment, it does not perform very well. To get agents that generalize well to previously unseen test environments, we also need:
4. (A)utomatic (D)omain (R)andomization: The training environments are procedurally generated, and the parameters are randomized during each episode. There is a curriculum that automatically increases the difficulty of the environments in lockstep with the agent’s capabilities.
With all of these ingredients, the resulting agent can even culturally learn from a human player, despite only encountering bots during training.
Rohin's opinion: I liked the focus of this paper on identifying the ingredients for cultural transmission, as well as the many ablations and experiments to understand what was going on, many of which I haven’t summarized here. For example, you might be interested in the four phases of learning of MEDAL without ADR (random behavior, expert following, cultural learning, and solo learning), or the cultural transmission metric they use, or the “social neurons” they identified which detect whether the expert bot is present.
DEEP LEARNING
Improving language models by retrieving from trillions of tokens (Sebastian Borgeaud et al) (summarized by Rohin): We know that large language models memorize a lot of their training data, especially data that gets repeated many times. This seems like a waste; we’re interested in having the models use their parameters to implement “smart” computations rather than regurgitation of already written text. One natural idea is to give models the ability to automatically search previously written text, which they can then copy if they so choose: this removes their incentive to memorize a lot of training data.
The key to implementing this idea is to take a large dataset of text (~trillions of tokens), chunk it into sequences, compute language model representations of these sequences, and store them in a database that allows for O(log N) time nearest-neighbor access. Then, every time we do a forward pass through the model that we’re training, we first query the database for the K nearest neighbors (intuitively, the K most related chunks of text), and give the forward pass access to representations for those chunks of text and the chunks immediately following them. This is non-differentiable – from the standpoint of gradient descent, it “looks like” there’s always some helpful extra documents that often have information relevant to predicting the next token, and so gradient descent pushes the model to use those extra documents. There’s a bunch of fiddly technical details to get this all working that I’m not going to summarize here.
As a side benefit, once you have this database of text representations that supports fast nearest neighbor querying, you can also use it to address the problem of test set leakage. For any test document you are evaluating on, you can look for the nearest neighbors in the database and look at the overlap between these neighbors and your test document, to check whether your supposedly “test” document was something the model might have trained on.
The evaluation shows that the 7 billion parameter (7B) Retro model from the paper can often do as well as or better than the 280B Gopher or 178B Jurassic-1 (both of which outperform GPT-3) on language modeling, and that it also does well on question answering. (Note that these are both tasks that seem particularly likely to benefit from retrieval.)
NEWS
Apply to the Open Philanthropy Technology Policy Fellowship! [EA · GW] (Luke Muehlhauser) (summarized by Rohin): This policy fellowship (AN #157) on high-priority emerging technologies is running for the second time! Application deadline is September 15.
Job ad: DeepMind Long-term Strategy & Governance Research Scientist (summarized by Rohin): The Long-term Strategy and Governance Team at DeepMind works to build recommendations for better governance of AI, identifying actions, norms, and institutional structures that could improve decision-making around advanced AI. They are seeking a broad range of expertise including: global governance of science and powerful technologies; the technical landscape; safety-critical organisations; political economy of large general models and AI services. The application deadline is August 1st.
Also, the Alignment and Scalable Alignment teams at DeepMind are hiring [AF · GW], though some of the applications are closed at this point.
Job ads: Anthropic (summarized by Rohin): Anthropic is hiring for a large number of roles (I count 19 different ones as of the time of writing).
Job ad: Deputy Director at BERI [EA · GW] (Sawyer Bernath) (summarized by Rohin): The Berkeley Existential Risk Initiative (BERI) is hiring a Deputy Director. Applications will be evaluated on a rolling basis.
Job ads: Centre for the Governance of AI (summarized by Rohin): The Centre for the Governance of AI has several roles open, including Research Scholars (General Track and Policy Track), Survey Analyst, and three month fellowships. The application deadlines are in the August 1 - 10 range.
Job ads: Metaculus (summarized by Rohin): Metaculus is hiring for a variety of roles, including an AI Forecasting Lead.
Job ads: Epoch AI (summarized by Rohin): Epoch AI is a new organization that investigates and forecasts the development of advanced AI. They are currently hiring for a Research Manager and Staff Researcher position.
Job ad: AI Safety Support is hiring a Chief Operating Officer (summarized by Rohin): Application deadline is August 14.
FEEDBACK
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
9 comments
Comments sorted by top scores.
comment by Quintin Pope (quintin-pope) · 2022-07-21T05:40:09.329Z · LW(p) · GW(p)
Thanks for producing these summaries of such interesting works!
About Chinchilla, you said:
This paper performs a more systematic study than the original paper and finds that existing models are significantly overtrained.
I think this is supposed to say "undertrained", right?
Also, I've personally shortened my AI timelines in light of the Chinchilla paper because their results strongly imply that human brains are also undertrained (on text, at least). I'd already suspected as much from previous scaling laws, but the Chinchilla paper confirmed it for me. Thus, I think an AI with substantially lower parameter counts than the brain will be able to match human performance if it's trained in a compute-optimal manner.
Conversely, it suggests we may be able to increase human intelligence by "training" on substantially more text data. I've speculated on some ways to do that. [LW · GW]
Replies from: rohinmshah, Charlie Steiner↑ comment by Rohin Shah (rohinmshah) · 2022-07-21T16:23:28.443Z · LW(p) · GW(p)
Yeah I think that's another reasonable way to update on timelines. Here you are anchoring biological scaling laws on artificial scaling laws, rather than anchoring artificial parameters on biological parameters and leaving the scaling laws as a free variable (as done by the existing model).
One major counterargument would be "biological learning algorithms are better than artificial ones and can learn faster and so have better scaling laws".
Separately, you can get some a priori support for "human brain is undertrained relative to our optimal compute law" if you think that, for evolution, scaling up data by 2x is a higher cost than scaling up brain size by 2x. (For neural networks these are both equally costly, if you look only at training compute.) This seems pretty plausible -- having twice as long a childhood can make it way more likely that you die before you ever reproduce, while having twice the brain size imposes higher metabolic costs, and plausibly the former is a lot more costly on the margin.
↑ comment by Charlie Steiner · 2022-07-21T10:20:47.081Z · LW(p) · GW(p)
I think this is supposed to say "undertrained", right?
Undertraining per parameter is equivalent to overtraining per datum (for fixed compute). So Rohin's usage makes sense in context, but also I agree with you that the word is confusing :P
Replies from: rohinmshah, quintin-pope↑ comment by Rohin Shah (rohinmshah) · 2022-07-21T16:13:05.680Z · LW(p) · GW(p)
I did just mean undertrained (because I'm ~always using it in the per-parameter sense, which I think is how other people use it too).
↑ comment by Quintin Pope (quintin-pope) · 2022-07-21T16:10:29.148Z · LW(p) · GW(p)
This still seems confusing to me. Rohin says that the model is overtrained (not something like "prior approaches overtrained on limited data"), so it seems like he's talking about the parameters and not the data.
Replies from: rohinmshah↑ comment by Rohin Shah (rohinmshah) · 2022-07-21T16:13:29.930Z · LW(p) · GW(p)
Yeah I meant undertrained, I've fixed it now.
comment by cubefox · 2022-07-24T20:01:02.038Z · LW(p) · GW(p)
Regarding the "Training Compute-Optimal Large Language Models" paper. Whether you should lengthen or shorten timelines depends on which is easier to scale up: number of parameters or data. The paper says that number of parameters are less important than previously thought. The question is why existing models are undertrained (use too little data/to many parameters). Either it is because of the old scaling laws paper which overestimated the importance of parameters, or it is because scaling up data is actually harder than scaling up parameters. If the latter is the case, this would lengthen timelines, not shorten them.
Replies from: gwern↑ comment by gwern · 2022-07-25T15:29:07.038Z · LW(p) · GW(p)
The question is why existing models are undertrained (use too little data/to many parameters). Either it is because of the old scaling laws paper which overestimated the importance of parameters, or it is because scaling up data is actually harder than scaling up parameters.
It was the former. All of those models were in the <1 epoch regime, so they didn't even use all of the data they already had (much less the data they could've collected before hitting marginal gain parity in spending resources on either another unit of compute or another unit of data).
comment by Bill Benzon (bill-benzon) · 2022-07-21T15:18:38.121Z · LW(p) · GW(p)
Thanks for this.