Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

nick_greig

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

post by Nick_Greig · 2022-11-18T12:46:45.563Z · LW · GW · No comments

This is a question post.

  Answers
    7 Lawrence Chan
    1 Sheikh Abdur Raheem Ali
None
No comments

I am not sure if this paper is flying under the radar for many people, but has anyone read Transcending Scaling Laws with 0.1% Extra Compute? If so, how do you think it compares to the scaling laws presented by Deepmind's An empirical analysis of compute-optimal large language model training? Does it make you rethink the importance of dataset size (again)?

Answers

answer by LawrenceC (Lawrence Chan) · 2022-11-21T00:55:54.235Z · LW(p) · GW(p)

tl;dr: The shape of the curve probably doesn't change, but the compute-optimal LM training will use less data than the Chinchilla scaling law suggests.

One of the takeaways from the last two years of LM progress is that GPT-3/Chinchilla's next-token-prediction objective is not the most efficient way to use data.* Instead, objectives require the model to infill missing tokens in the middle of a text string, like the T5 objective or the UL2 objective, are much more efficient per unit data.

Figure 2 of the Tay et al UL2R paper shows how UL2 finetuning serves as either a multiple or a constant increase in training flops. Assuming that the improvement holds across the board, this means that UL2 finetuning makes models ~1.5-3x more data efficient. So if before, the optimal trade off for X flops was Y params times Z tokens, with a better objective (or finetuning the objective better), we might see 1.5 Y params and 0.66 Z tokens.

It's worth noting that this still implies a linear relationship between the optimal param count and token count, it's just that if you use a better objective it's better to use more params and fewer tokens than what the next-token log loss--based Chinchilla scaling laws would predict.

* Arguably, we knew this from BERT, where you'd get better finetuned performance on downstream tasks if you pretrained with bidirectional objectives, but I think the result that the next-token prediction objective is worse for text generation tasks is new.

answer by Sheikh Abdur Raheem Ali · 2022-11-19T22:19:22.747Z · LW(p) · GW(p)

Mostly already did my updates when “Efficient Training of Language Models to Fill in the Middle” https://arxiv.org/abs/2207.14255 came out.

No comments

Comments sorted by top scores.

Updates on scaling laws for foundation models from ' Transcending Scaling Laws with 0.1% Extra Compute'

Contents

Answers

No comments