"Scaling Laws for Autoregressive Generative Modeling", Henighan et al 2020 {OA}

post by gwern · 2020-10-29T01:45:30.666Z · LW · GW · 11 comments

This is a link post for https://arxiv.org/abs/2010.14701

11 comments

Comments sorted by top scores.

comment by gwern · 2020-10-30T00:10:00.798Z · LW(p) · GW(p)

My short summary so far:

GPT-3 was not a fluke nor language-specific: all modalities tested---math, video, images, text, combined---scale cleanly and in the same way where bigger models = better; the unsupervised/pretrained models then transfer to supervised learning, like image classification. GPT-3 all the things!

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-10-30T07:20:35.794Z · LW(p) · GW(p)

I think it also mentioned that it isn't architecture-specific either; bigger LSTMs scale similarly to bigger transformers, they are just worse. IIRC.

Replies from: gwern
comment by gwern · 2020-10-30T14:18:16.210Z · LW(p) · GW(p)

Up to a certain limit; Kaplan covers this in the talk a bit with reference to the RNN scaling curves in Kaplan et al 2020 - RNNs scale similarly to Transformers, with a worse constant in terms of compute, but they make bad use of context. After a few hundred tokens, the history has vanished. This is the usual RNN problem: theoretically, the history is unlimited, but as has been observed long before, the history is de facto limited to a few hundred tokens, while Transformers make effective use of history from thousands of timesteps before.

So I interpret this as meaning that NN architectures are all 'universal' in a sense (they all scale similarly, and I'm told that CNNs do too), but what makes Transformers superior is that they are more compute-efficient on current hardware and they optimize much better because, as 'unrolled RNNs', they are equivalently powerful but they have much more direct access to the history (pace residual layers) which makes the credit assignment/learning much easier than RNNs which must squeeze it all into a hidden state rather than recalculating a function with the entire raw history available.

(Lots of potential followup questions here: can you usefully distill a trained Transformer into a parameter & compute-efficient RNN? Can that provide a training signal to meta-learn RNN algorithms which do fix their history/optimization problems? If Transformers work so well because of raw long-range access to history, are RNNs just missing some 'external memory' module which would serve the same purpose? Do RNNs likewise have general scaling curves over all modalities? Where do Mixture-of-Experts flatline and what is the MoE scaling exponent?)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-10-29T09:08:33.666Z · LW(p) · GW(p)

Previously I asked where is human level for text prediction [LW · GW], so that I could take the graph from the GPT-3 paper and extrapolate it. (Gwern did the honors) [LW(p) · GW(p)] Now I ask: Where is human level for the other five tasks depicted in these new graphs? Anybody know? (I searched for "human" in the paper and found nothing)

Replies from: SDM, Lanrian
comment by Sammy Martin (SDM) · 2020-10-29T18:58:56.093Z · LW(p) · GW(p)

I'm still a bit puzzled by the link between human level on text prediction and 'human level' unconditionally - if I recall our near-bet during the forecasting tournament, our major disagreement was on whether direct scaling of GPT like systems takes us near to AGI. I often think that (because we don't have direct experience with any verbal intelligences in capability between GPT-3 and human brains) we're often impoverished when trying to think about such intelligences. I imagine that a GPT-6 that is almost 'human level on text prediction' could still be extremely deficient in other areas - it would be very weird to converse with, maybe like an amnesiac or confabulator that's very articulate and with good short-term memory.

If language models scale to near-human performance but the other milestones don't fall in the process, and my initial claim is right, that gives us very transformative AI but not AGI. I think that the situation would look something like this:

If GPT-N reaches par-human:

discovering new action sets

managing its own mental activity

(?) cumulative learning

 

human-like language comprehension

perception and object recognition

efficient search over known facts

So there would be 2 (maybe 3?) breakthroughs remaining. It seems like you think just scaling up a GPT will also resolve those other milestones, rather than just giving us human-like language comprehension. Whereas if I'm right and also those curves do extrapolate, what we would get at the end would be an excellent text generator, but it wouldn't be an agent, wouldn't be capable of long-term planning and couldn't be accurately described as having a utility function over the states of the external world, and I don't see any reason why trivial extensions of GPT would be able to do that either since those seem like problems that are just as hard as human-like language comprehension. GPT seems like it's also making some progress on cumulative learning, though it might need some RL-based help [LW(p) · GW(p)]with that, but none at all on managing mental activity for longterm planning or discovering new action sets.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-10-29T19:10:08.014Z · LW(p) · GW(p)

I agree with you that human level on text prediction is not the same as human level generally, and that we might get the former long before we get the latter. Nevertheless, I think it's possible that the two will come together -- or, more cautiously, that something human-level at task prediction would also be transformative. (Maybe the text predictor by itself wouldn't be an agent, but the text predictor could be re-trained as an agent fairly easily, or combined into a larger system that uses tree search or something and thus is an agent. Maybe it would be good enough at enough things that it would make a ridiculous amount of money automating away many jobs and inspire a huge boost in investment, which would then lead to human-level AGI. Or maybe it would be really good at persuasion... I'm going to write a post about persuasion tools someday soon.)

Replies from: SDM
comment by Sammy Martin (SDM) · 2020-10-29T20:44:59.458Z · LW(p) · GW(p)

I think that it could plausibly be quite transformative in a TAI sense and occur over the next ten years, so perhaps we don't have all that much of a disagreement on that point. I also think (just because we don't have an especially clear idea of  how modular intelligence is) that it could be quite uniform and a text predictor could surprise us with humanlike planning.

Maybe the text predictor by itself wouldn't be an agent, but the text predictor could be re-trained as an agent fairly easily, or combined into a larger system that uses tree search or something and thus is an agent. 

This maybe reflects a difference in intuition about how difficult agentive behaviour is to reach rather than language understanding. I would expect a simple tree search algorithm powered by GPT-6 to be... a model with humanlike language comprehension and incredibly dumb agentive behaviour, and that it wouldn't be able to leverage the 'intelligence' of the language model in any significant way, because I see that as a seperate problem requiring seperate, difficult work. But I could be wrong.

I think there is a potential bias in that human-like language understanding and agentive behaviour have always gone together in human beings - we have no idea what a human-level language model that wasn't human-level intelligent would be like. Since we can't imagine it, we tend to default to imagining a human-in-a-box. I'm trying to correct for this bias by imagining that it might be quite different.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-10-30T07:10:04.703Z · LW(p) · GW(p)

Nice, that's probably a crux for us. I would expect tree search powered by GPT-6 to be probably pretty agentic. Isn't that how AlphaZero works? Tree search + a win probability predictor?

Replies from: SDM
comment by Sammy Martin (SDM) · 2020-10-30T14:28:12.882Z · LW(p) · GW(p)

It may well be a crux - an efficient 'tree search' or a similar goal-directed wrapper around a GPT-based system, that can play a role in real-world open-ended planning (presumably planning for an agent to be effecting outcomes in the real world via its text generation), would have to cover continuous action spaces and possible states containing unknown and shifting sets of possible actions (unlike the discrete and small, relative to the real universe, action space of Go which is perfect for a tree search), running (or approximating running) millions of primitive steps (individual text generations and exchanges) into the future (for long-term planning towards e.g. a multi-decade goal like humans are capable of).

That sounds like a problem that's at least as hard as a language-model 'success probability predictor' GPT-N (probably with reward-modelling help [LW(p) · GW(p)], so it can optimize for a specific goal with its text generation). Though such a system would still be highly transformative, if it was human-level at prediction.

To clarify, this is Transformative not 'Radically Transformative' - transformative like Nuclear Power/Weapons, not like a new Industrial Revolution or an intelligence explosion.

I would expect tree search powered by GPT-6 to be probably pretty agentic.

I could imagine (if you found a domain with a fairly constrained set of actions and states, but involved text prediction somehow) that you could get agentic behaviour out of a tree search like the ones we currently have + GPT-N + an RL wrapper around the GPT-N [LW(p) · GW(p)]. That might well be quite transformative - could imagine it being very good for persuasion, for example.

comment by Lukas Finnveden (Lanrian) · 2020-10-29T13:43:20.237Z · LW(p) · GW(p)

In this case, it seems especially important whether the purported irreducible entropy is below human-level performance (in which case sufficiently scaled models would outperform humans, if the scaling laws holds up) or if they're above human-level (in which case the constant loss isn't irreducible at all, but betrays some limits of the models).

comment by mtaran · 2020-10-29T07:02:19.497Z · LW(p) · GW(p)

https://youtu.be/QMqPAM_knrE is a video by one of the authors presenting on this research