Deepmind's Gopher--more powerful than GPT-3

post by hath · 2021-12-08T17:06:32.650Z · LW · GW · 26 comments

This is a link post for https://deepmind.com/blog/article/language-modelling-at-scale

26 comments

Comments sorted by top scores.

comment by Quintin Pope (quintin-pope) · 2021-12-08T19:50:55.481Z · LW(p) · GW(p)

It’s good to see DeepMind addressing ethical/safety aspects of their work. The linked blog post isn’t the only thing DeepMind published about the new model. Here is a very long report about many different aspects of the model. Of particular interest is:

We separately consider a retrieval mechanism searching over the training set for relevant extracts during pre-training (Borgeaud et al., 2021), partially avoiding the need to memorise knowledge into network weights. This approach reached GPT-3-level language model performance with a 7 billion parameter model and over a 10× reduction in training compute.

In More Christiano, Cotra, and Yudkowsky on AI progress [? · GW], Paul Christiano [LW · GW] said:

like I'm surprised if a clever innovation does more good than spending 4x more compute

when discussing different predictions made by his vs Yudkowsky‘s models of AI progress. However, Paul was specifically referring to a “clever innovation” which occurs several years in the future. If DeepMind’s 10x more efficient claim holds up, is that a bigger jump than Paul predicted would be plausible today?

Replies from: paulfchristiano, matthew-barnett, Veedrac, LawChan
comment by paulfchristiano · 2021-12-08T22:56:34.444Z · LW(p) · GW(p)

Just to give the full quote:

And I'd reject LSTM -> transformer or MoE as an example because the quantitative effect size isn't that big.

But if something like that made the difference between "this algorithm wasn't scaling before, and now it's scaling," then I'd be surprised.

And the size of jump that surprises me is shrinking over time. So in a few years even getting the equivalent of a factor of 4 jump from some clever innovation would be very surprising to me.

The text you quoted was clarifying what "factor of 4" means in that sentence.

I'm not surprised by "googling relevant terms and then putting the results in context improves language modeling loss and performance on knowledge-loaded tasks." This looks like basically a great implementation of that idea, along with really solid LM infrastructure in general.

I don't really even have a firm quantitative prediction of how much this kind of thing will improve the LM loss in the world "in a few years" that the quote describes. Note that the effect of this result on downstream performance is almost certainly (much) less than its effect on LM loss, because for most applications you will already be doing something to get relevant information into the LM context (especially for a task that was anywhere near as knowledge-loaded as the LM task, which is usually pretty light on reasoning).

(ETA: as Veedrac points out [LW(p) · GW(p)], it also looks on a first skim like quite a lot of the difference is due to more effectively memorizing nearly-identical text that appeared in the training set, which is even less helpful for downstream performance. So sticking with "even if this is a 10x gain on LM task according to the formal specification, it's not nearly such a big deal for downstream tasks.")

My logic for making predictions about this kind of thing is roughly:

  1. In the next few years LM inference will be using large amounts of compute, with costs likely measured in hundreds of millions per year.
  2. Engineering effort to improve performance on the applications people care about is likely to be in the hundreds of millions or billions.
  3. If low-hanging fruit like this hasn't been plucked at that point, people are probably correct that it's not going to be a huge effect relative to the things they are spending money on.
  4. People will probably be in the regime where doubling model quality costs more like $100M than $10M. Sometimes you will get lucky and find a big winner, but almost never so lucky as "clever trick that's worth 4x. (Right now I think we are maybe at $10M/doubling though I don't really know and wouldn't want to speak specifically if I did.)
  5. (And it won't be long after that before doubling model quality for important tasks costs $1B+, e.g. semiconductors are probably in the world where it costs $100B or something to cut costs by half. You'll still sometimes have crazy great startups that spend $100M and manage to get 4x, which grow to huge valuations quickly, but it predictably gets rarer and rarer.)
Replies from: tomasz-darmetko
comment by Tomasz Darmetko (tomasz-darmetko) · 2021-12-09T08:12:10.677Z · LW(p) · GW(p)

I agree with you on tasks where there is not a lot of headroom. But on tasks like International Olympiad level mathematics and programming 4x reduction in model size keeping performance constant will be small. I expect many 1000x and bigger improvements vs. what scaling laws would predict currently.

For example, on MATH dataset "(...) models would need around 10^35 parameters to achieve 40% accuracy" where 40% accuracy is achieved by a PhD student and International Olympiad participant will get close to 90%. https://arxiv.org/abs/2103.03874

With 100 trillion models (10^14) we would still be short by 10^21 parameters. So we will need to get some 20 orders of magnitude improvements in model size for the same performance from somewhere else.

Worth noticing the 40% vs. 90% gap for expert humans on MATH. And similar gap on MMLU (Massive Multitask Language Understanding) 35% for average human vs. 90% experts. Experts don't have orders of magnitude bigger brains, different architecture or learning algorithm in their brains.

When replying, I also noticed that I made assumptions about what mean by x factor quality improvement. I'm not sure I understood correctly. Could you clarify what you meant precisely?

Replies from: paulfchristiano
comment by paulfchristiano · 2021-12-09T15:33:18.537Z · LW(p) · GW(p)

If you have big communities working on math, I don’t think you will see improvements like 1000x model size (the bigger the community, the harder it will be to get any fixed size of advantage). And I think you will have big communities working on the problem well before it becomes a big deal economically (the bigger the economic deal, the bigger the community). Both of those are quantitative and imperfect and uncertain, but I think they are pretty important rules of thumb for making sense of what happens in the world.

Regarding the IMO disagreement, I think it's very plausible the IMO will be solved before there is a giant community. So that's more of a claim that even now, with not many people working on it, you probably aren't going to get progress that fast. I don't feel like this speaks to either the two main disagreements with Eliezer, but it does speak to something like "How often do we see jumps that look big to Paul?" where I'm claiming that I have a better sense for what improvements are "surprisingly big."

comment by Matthew Barnett (matthew-barnett) · 2021-12-08T20:06:13.665Z · LW(p) · GW(p)

Also worth noting is that the model was trained in December 2020, a year ago. I don't know when GPT-3 was trained, but if the time-gap between the two is small, that sure looks like a substantial discontinuity in training efficiency. (Though I'd prefer to see long-run data).

Replies from: paulfchristiano, LawChan
comment by paulfchristiano · 2021-12-09T03:22:05.718Z · LW(p) · GW(p)

If two people trained language models at the same time and one was better than the other, would you call it infinitely fast progress?

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2021-12-09T05:22:04.424Z · LW(p) · GW(p)

I’m confused what you’re asking.

The observation that two SOTA language models trained close together in time were substantially different in measured performance provides evidence of a discontinuity, as defined in the usual sense of a large residual from prior extrapolation.

I can answer your question literally: I don’t think that would be infinitely fast progress. I am genuinely unsure what your point is though. :)

Replies from: Joe_Collman
comment by Joe_Collman · 2021-12-09T20:25:25.582Z · LW(p) · GW(p)

I think there's a significant point here: that it only makes sense to compare with the expected trend rather than with one data point.
In particular, note that if Gopher had been released one day before GPT-3, then GPT-3 wouldn't have been SOTA, and the time-to-achieve-x-progress would look a lot longer.

(FWIW, it still seems like a discontinuity to me)

comment by LawrenceC (LawChan) · 2021-12-08T20:18:13.701Z · LW(p) · GW(p)

GPT-3 appeared on arXiv in May 2020: https://arxiv.org/abs/2005.14165

Though I don't know exactly when it was trained. 

Replies from: conor-sullivan
comment by Lone Pine (conor-sullivan) · 2021-12-08T21:12:18.083Z · LW(p) · GW(p)

It was trained with internet data from October 2019. So it must have been trained between October 2019 and May 2020.

comment by Veedrac · 2021-12-08T22:57:42.673Z · LW(p) · GW(p)

Skimming the Rᴇᴛʀᴏ paper is weird because it looks like there's leakage everywhere, they admit leakage is everywhere, but then they sort of report results like it doesn't matter, even putting a result on their leakiest dataset in their conclusion?

On Wikitext103 and the Pile, Retro outperforms previous models trained on large scale datasets.

It looks to me like Figure 6 is saying the improvement is fairly modest in unleaky datasets?

Maybe someone who has gone over the paper in detail can chime in with thoughts.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2021-12-09T01:34:58.816Z · LW(p) · GW(p)

Could you explain a little more about what you mean by data leakage? Do you mean that complete copies of the text sampled for the evaluation set exist in the training set? Is this one of those things where curating a good dataset is a surprising amount of the work of ML, and so a lot of people haven't done it?

Edit: Oh. I have now looked at the Retro paper. I'd still be interested in hearing your take on what makes different datasets leaky.

Replies from: Veedrac
comment by Veedrac · 2021-12-09T01:49:27.505Z · LW(p) · GW(p)

Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.

In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.

comment by LawrenceC (LawChan) · 2021-12-08T20:10:09.492Z · LW(p) · GW(p)

However, he was specifically referring to a clever innovation which occurs several years in the future. If the DeepMind’s 10x more efficient claim holds up, is that a bigger jump than Paul Christiano predicted would be plausible today?

I'd also be interested in hearing Paul/Eliezer's takes on RETRO, though I don't think they reported compute scaling curves in their paper? 

Headline quote:

With a 2 trillion token database, our Retrieval-Enhanced Transformer (Retro) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters.

Relevant figures for performance are figure 1 and figure 3

comment by Signer · 2021-12-08T19:04:03.648Z · LW(p) · GW(p)

a study of ethical and social risks associated with large language models

And somehow nobody cares about potential ethical implications of simulating near-human quantities of neurons.

Replies from: quintin-pope
comment by Quintin Pope (quintin-pope) · 2021-12-08T19:18:45.089Z · LW(p) · GW(p)

280 billion parameters is still far less than the human brain. It’s closer to a rat’s brain. Maybe even smaller than that.

Replies from: Signer, aeviternity1
comment by Signer · 2021-12-08T19:46:12.494Z · LW(p) · GW(p)

Sure, but people do worry about harming rats too much, and, more importantly, by the time we get to actual human level it may be already late. Like, there is no prepared procedure for stopping that whole process of scaling, no robust humanity-meters to know when you can safely proceed, and even no consensus on relevant abstract ethics.

comment by Lost Futures (aeviternity1) · 2021-12-08T19:46:38.331Z · LW(p) · GW(p)

Deepmind's recent research puts some holes in the already shaky analogy between synapses and parameters. RETRO achieved comparable performance to GPT-3 despite having 25x fewer parameters.

Replies from: paulfchristiano, quintin-pope
comment by paulfchristiano · 2021-12-08T22:59:32.451Z · LW(p) · GW(p)

A human with google also gets way better performance than a human without google on "predict the next word of this website," so I'm not sure this undermines the analogy.

comment by Quintin Pope (quintin-pope) · 2021-12-08T20:01:03.910Z · LW(p) · GW(p)

We’ve know for a while that it’s possible to get good performance with far fewer parameters than BERT/GPT architectures use. E.g., Albert. The key point is that Gopher is much smaller and less capable than the human brain, even if we don’t know the appropriate metric by which we should compare such systems.

Replies from: aeviternity1
comment by Lost Futures (aeviternity1) · 2021-12-08T20:21:16.103Z · LW(p) · GW(p)

Agreed, per Sam Altman's statements, improving performance without scaling is also OpenAI's plan for GPT-4. And Gopher is far less capable than a human brain. It's just the "synapses as parameters" analogy that irks me. I see it everywhere but it isn't reliable and (despite disclaimers that the analogy isn't 1 to 1) leads people to even less reliable extrapolations. Hopefully, a better metric will be devised soon.

comment by LawrenceC (LawChan) · 2021-12-08T20:16:08.325Z · LW(p) · GW(p)

I'm interested in hearing takes from NLP people on the RETRO paper. How impressive is their new retrieval technique? How surprising is RETRO's performance given the amount of compute/tokens/parameters they used? 

Certainly the headline - 25x fewer params to get comparable performance to GPT-3 - is impressive, but I don't know how to evaluate the retrieval part of the model, and I don't think they report compute costs in their paper as well?

Replies from: dan-hendrycks
comment by Dan H (dan-hendrycks) · 2021-12-09T05:55:46.220Z · LW(p) · GW(p)

RE: "like I'm surprised if a clever innovation does more good than spending 4x more compute"

Earlier this year, DeBERTaV2 did better on SuperGLUE than models 10x the size and got state of the art.

Models such as DeBERTaV3 can do better than on commonsense question answering tasks than models that are tens or several hundreds of times larger.

DeBERTaV3-large

Accuracy: 84.6   1  Parameters: 0.4B

T5-11B

Accuracy: 83.5  1  Parameters: 11B

Fine-tuned GPT-3

73.0  1  175B

https://arxiv.org/pdf/2112.03254.pdf#page=5

Bidirectional models + training ideas + better positional encoding helped more than 4x.

Replies from: paulfchristiano
comment by paulfchristiano · 2021-12-09T15:42:39.843Z · LW(p) · GW(p)
  • ETA: I'm talking about the comparison to SOTA from a new clever trick. I'm not saying that "the cumulative impacts of all clever ideas is <4x," that would be obviously insane. (I don't know how big an improvement DeBERTaV2 is over SoTA.  But isn't RoBERTa from August 2019, basically contemporary with SuperGLUE, and gets 84.6% accuracy with many fewer parameters than T5? So I don't think I care at all about the comparison to T5.)
  • I said I would be surprised in a couple years, not that I would be surprised now.
  • I'm less surprised on SuperGLUE than downstream applications.
  • Much of the reason for the gap seems to be that none of the models you are comparing DeBERTaV2 against seem to be particularly optimized for SuperGLUE performance (in part because it's a new-ish benchmark that doesn't track downstream usefulness that well, so it's not going to be stable until people try on it). (ETA: actually isn't this just because you aren't comparing to SOTA? I think this was probably just a misunderstanding.)
  • Similarly, I expect people to get giant model size gains on many more recent datasets for a while at the beginning (if people try on them), but I think the gains from small projects or single ideas will be small by the time that a larger effort has been made.
Replies from: dan-hendrycks
comment by Dan H (dan-hendrycks) · 2021-12-12T07:20:06.373Z · LW(p) · GW(p)

Note I'm mainly using this as an opportunity to talk about ideas and compute in NLP.

I don't know how big an improvement DeBERTaV2 is over SoTA.

DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa's high performance isn't an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.

Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that's been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it's >400x smaller.

I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.

For tasks when there is less traction, ideas are even more useful. Just to use a recent example, "the use of verifiers results in approximately the same performance boost as a 30x model size increase." I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn't help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious "low hanging fruits" because it might take a while for the community to get oriented and find good angles of attack.