[Preprint] The Computational Limits of Deep Learning

post by Gordon Seidoh Worley (gworley) · 2020-07-21T21:25:56.989Z · LW · GW · 4 comments

This is a link post for https://arxiv.org/abs/2007.05558

Contents

4 comments

"The Computational Limits of Deep Learning" by Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F. Manso

Links:

NB: This is a preprint and not peer-reviewed or accepted for publication as best I can tell, so more than usual you'll have to make your own judgements about the quality of the results.

Abstract:

Deep learning's recent history has been one of achievement: from triumphing over humans in the game of Go to world-leading performance in image recognition, voice recognition, translation, and other tasks. But this progress has come with a voracious appetite for computing power. This article reports on the computational demands of Deep Learning applications in five prominent application areas and shows that progress in all five is strongly reliant on increases in computing power. Extrapolating forward this reliance reveals that progress along current lines is rapidly becoming economically, technically, and environmentally unsustainable. Thus, continued progress in these applications will require dramatically more computationally-efficient methods, which will either have to come from changes to deep learning or from moving to other machine learning methods.

A few additional details: they look at papers in ML to see how much compute was required to get results, and extrapolate the trend lines to suggest we're nearing the limits of what is economically feasible to do under the current regime. They believe this implies we'll have to get more efficient if we want to see continued progress, such as by having more specialized and efficient hardware or by improving algorithms. My takeaway is that they believe most of the low hanging fruit in ML gains has already been picked, and additional gains in capabilities will not come as easily as past gains.

The straightforward implications for safety are that, if this is true, we are less near x-risk territory than it might appear we are if you were to only look at the "numerator" of the trend lines (what we can do) without consider the "denominator" of them (how much it costs). Not that we are necessary dramatically far from x-risk territory with ML, mind you, only that it's not obviously very near term since the economic realities of deploying this technology will soon shift to naturally slow immediate progress without significant effort or innovation.

4 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-07-22T00:24:55.201Z · LW(p) · GW(p)

I believe Gwern had some harsh words for this paper. (See below) I'd be interested to see a response from fans of the paper.

As I mentioned on Twitter, it's amazing that they wrote an entire paper trying to estimate performance scaling with compute, and ignored what looks like the entire literature doing actual controlled highly-precise experiments on scaling up fixed architectures (no citations to any of them that I could see) in favor of grabbing random datapoints from the overall literature.
Why should anyone pay any attention to their estimates, which are so unreliable and vague? Why would you do that and ignore (mini literature review follows): Sun et al 2017, Hestness et al 2017, Shallue et al 2018, McCandlish et al 2018, Rosenfeld et al 2019, Li et al 2020, Kaplan et al 2020, Roller et al 2020, Chen et al 2020a, Chen et al 2020b
, Lepikhin et al 2020, and Huggingface 2020?
And the rest is not much better, like the de rigeur 'green' CO2 estimates (as if training DL actually emitted CO2, as if 'green' approaches aren't just doomed from the start as the most efficient NNs always start from research on the very large models they would like to rule out, as if large high-performance NNs aren't used in the real world in any way and do not replace even more CO2-intensive systems like say humans, as if CO2 costs are even the most important cost to begin with...). This isn't a paper that needs any extensive critique, let us say.
Replies from: weverka
comment by weverka · 2022-12-02T05:02:08.854Z · LW(p) · GW(p)

Gwern asks"Why would you do that and ignore (mini literature review follows):"  

Thompson did not ignore the papers Gwern cites.  A number of them are in Thompson's tables comparing prior work on scaling.  Did Gwern tweet this criticism without even reading Thompson's paper?

Replies from: gwern
comment by gwern · 2022-12-29T21:45:51.865Z · LW(p) · GW(p)

I did read it, and he did ignore them. Do you really think I criticized a paper publicly in harsh terms for not citing 12 different papers without even checking the bibliography or C-fing the titles/authors? Please look at the first 2020 paper version I was criticizing in 16 July 2020, when I wrote that comment, and don't lazily misread the version posted 2 years later on 27 July 2022 which, not being a time traveler, I obviously could not have read or have been referring to (and which may well have included those refs because of my comments there & elsewhere).

(Not that I am impressed by their round 2 stuff which they tacked on - but at least now they acknowledge that prior scaling research exists and try to defend their very different approach at all.)

Replies from: weverka
comment by weverka · 2023-01-03T17:30:19.967Z · LW(p) · GW(p)

I stand corrected.  Please forgive me.