Estimating training compute of Deep Learning models

post by lennart, Jsevillamol, Marius Hobbhahn (marius-hobbhahn), Tamay Besiroglu (tamay-besiroglu), anson.ho · 2022-01-20T16:12:43.497Z · LW · GW · 4 comments

Contents

    You can find the complete article here. We provide a short summary below.
  Method 1: Counting operations in the model
  Method 2:  GPU time
  Other parts of interest of this article include:
  Complete Article
None
4 comments

by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, and Anson Ho

You can find the complete article here. We provide a short summary below.

In short: To estimate the compute used to train a Deep Learning model we can either: 1) directly count the number of operations needed or 2) estimate it from GPU time.

Method 1: Counting operations in the model

Method 2: GPU time

We are uncertain about what utilization rate is best, but our recommendation is 30% for Large Language Models and 40% for other models.

You can read more about method 1 here and about method 2 here.

Other parts of interest of this article include:

Complete Article

You can find the article here.

4 comments

Comments sorted by top scores.

comment by Edouard Harris · 2022-01-21T15:26:22.730Z · LW(p) · GW(p)

This is fantastic. Really appreciate both the detailed deep-dive in the document, and the summary here. This is also timely, given that teams working on superscale models with concerning capabilities haven't generally been too forthcoming with compute estimates. (There are exceptions.)

As you and Alex point out in the sibling thread, the biggest remaining fudge factors seem to be:

  1. Mixture models (or any kind of parameter-sharing, really) for the first method, which will cause you to systematically overestimate the "Operations per forward pass" factor; and
  2. Variable effective utilization rates of custom hardware for the second method, which will cause an unknown distribution of errors in the "utilization rate" factor.

Nonetheless, my flying guess would be that your method is pretty much guaranteed to be right within an OOM, and probably within a factor of 2 or less. That seems pretty good! It's certainly an improvement over anything I've seen previously along these lines. Congrats!

comment by A Ray (alex-ray) · 2022-01-20T20:58:49.841Z · LW(p) · GW(p)

This seems pretty well done!  Some thoughts on future research in this direction:

  • It seems like you probably could have gotten certainty about compute for at least a handful of the models studied in question (either because the model was open sourced, or you have direct access to the org training it like Eleuther) -- it would be interesting to see how the estimation methods compared to the exact answer in this case.  (Probably doable with GPT-J for example)
  • While I agree with dropout not significantly reducing computation I think two more contemporary techniques are worth considering here: structured sparsity in weights ('blocksparse'), and mixture-of-experts gating ('switch transformer').  I think the latter is more important because it changes both the training compute and inference compute.
  • Comparing custom ML hardware (e.g. Google's TPUs or Baidu's Kunlun, etc) is tricky to put on these sorts of comparisons.  For those I think the MLPerf Benchmarks are super useful.  I'd be curious to hear the authors' expectations of how this research changes in the face of more custom ML hardware.
  • In general I think it'd be good to integrate a bunch of the performance benchmarks that are publicly available (since hardware providers are usually pretty eager to show off stats that make their hardware look good) into calibrations for this method.  It's also usually pretty straightforward to compute the operations and exact utilization in these runs, since they're heavily standardized on the exact model and dataset.
Replies from: Jsevillamol, lennart
comment by Jsevillamol · 2022-01-21T00:15:54.127Z · LW(p) · GW(p)

Thank you Alex! You make some great points.

It seems like you probably could have gotten certainty about compute for at least a handful of the models studied in question

We thought so too - but in practice it has been surprisingly hard. Profilers are surprisingly buggy. Our colleague Marious looked into this more in depth here [AF · GW].

Maybe we are just going the wrong way about it. If someone here figures out how to directly measure compute in eg a pytorch or TF model it would be a huge boon to us. 

I think two more contemporary techniques are worth considering here: structured sparsity in weights ('blocksparse'), and mixture-of-experts gating ('switch transformer')

Great suggestions! I think those would be a great future caveats to look into.

I'd be curious to hear the authors' expectations of how this research changes in the face of more custom ML hardware.

My naive impression is that our conclusions do not change much. You would just need to plug into the effective performance () in the second formula.

Probably the trickiest part might be figuring out the utilization rate for the custom hardware - though this is a general problem with the second method.

In general I think it'd be good to integrate a bunch of the performance benchmarks that are publicly available (since hardware providers are usually pretty eager to show off stats that make their hardware look good) into calibrations for this method.

I think that would be nice! We started a public spreadsheet with some info on different hardware. This might be of help to someone who wants to dig deeper into the topic!

comment by lennart · 2022-01-31T14:58:55.667Z · LW(p) · GW(p)

Comparing custom ML hardware (e.g. Google's TPUs or Baidu's Kunlun, etc) is tricky to put on these sorts of comparisons. For those I think the MLPerf Benchmarks are super useful. I'd be curious to hear the authors' expectations of how this research changes in the face of more custom ML hardware.

I'd be pretty excited to see more work on this. Jaime already shared our hardware sheet where we collect information on GPUs but as you outline that's the peak performance and sometimes misleading.

Indeed, the MLPerf benchmarks are useful. I've already gathered their data in this sheet and would love to see someone playing around with it. Next to MLPerf, Lambda Labs also shares some standardized benchmarks.