Reasons compute may not drive AI capabilities growth

post by Kythe · 2018-12-19T22:13:34.474Z · score: 46 (17 votes) · LW · GW · 10 comments

Contents

  There's many ways to train more efficiently that aren't widely used
  Hyperparameter grid searches are inefficient
  The types of compute we need may not improve very quickly
    Machine learning accelerators
    CPUs
    GPU/accelerator memory
  Limited ability to exploit parallelism
  Conclusion
None
10 comments

How long it will be before humanity is capable of creating general AI is an important factor in discussions of the importance of doing AI alignment research as well as discussions of which research avenues have the best chance of success. One frequently discussed model for estimating AI timelines is that AI capabilities progress is essentially driven by growing compute capabilities. For example, the OpenAI article on AI and Compute presents a compelling narrative, which shows a trend of well-known results in machine learning using exponentially more compute over time. This is an interesting model because if valid we can do some quantitative forecasting, due to somewhat smooth trends in compute metrics which can be extrapolated. However, I think there are a number of reasons to suspect AI progress to be driven more by engineer and researcher effort than compute.

I think there's a spectrum of models between:

My research hasn't pointed too solidly in either direction, but below I discuss a number of the reasons I've thought of that might point towards compute not being a significant driver of progress right now.

There's many ways to train more efficiently that aren't widely used

Starting October of 2017, the Stanford DAWNBench contest challenged teams to come up with the fastest and cheapest ways to train neural nets to solve certain tasks.

The most interesting was the ImageNet training time contest. The baseline entry took 10 days and cost $1112; less than one year later the best entries (all by the fast.ai team) were down to 18 minutes for $35, 19 minutes for $18 or 30 minutes for $14[^1]. This is ~800x faster and ~80x cheaper than the baseline.

Some of this was just using more and better hardware, the winning team used 128 V100 GPUs for 18 minutes and 64 for 19 minutes, versus eight K80 GPUs for the baseline. However, substantial improvements were made even on the same hardware. The training time on a p3.16xlarge AWS instance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months. The training time on a single Google Cloud TPU went down from 12 hours to 3 hours as the Google Brain team tuned their training and incorporated ideas from the fast.ai team. An even larger improvement was seen on the CIFAR10 contest recently, with times on a p3.2xlarge improving by 60x with the accompanying blog series still mentioning multiple improvements left on the table due to effort constraints. He also speculates that many of the optimizations would also improve the ImageNet version.

The main techniques used for fast training were all known techniques: progressive resizing, mixed precision training, removing weight decay from batchnorms, scaling up batch size in the middle of training, and gradually warming up the learning rate. They just required engineering effort to implement and weren't already implemented in the library defaults.

Similarly, the improvement due to scaling from eight K80s to many machines with V100s was partially hardware but also required lots of engineering effort to implement: using mixed precision fp16 training (required to take advantage of the V100 Tensor Cores), efficiently using the network to transfer data, implementing the techniques required for large batch sizes, and writing software for supervising clusters of AWS spot instances.

These results seem to show that it's possible to train much faster and cheaper by applying knowledge and sufficient engineering effort. Interestingly not even a team at Google Brain working to show off TPUs initially had all the code and knowledge required to get the best available performance, and had to gradually work for it.

I would suspect that in a world where we were bottlenecked hard on training times that these techniques would be more widely known about and applied, and implementations of them readily available for every major machine learning library. Interestingly, in postscripts to both of his articles on how fast.ai managed to achieve such fast times, Jeremy Howard notes that he doesn't believe large amounts of compute are required for important ML research, and notes that many foundational discoveries were available with little compute.

[^1]: Using spot/preemptible instance pricing instead of the on-demand pricing the benchmark page lists, due to much lower prices and the lack of need for on-demand instances given the short time. The authors of the winning solution wrote software to effectively use spot instances and actually used them for their tests. It may seem unfair to use spot prices for the winning solution but not for the baseline, but a lot of the improvement in the contest came from actually using all the techniques for faster/cheaper training available despite inconvenience, and they had to write software to easily use spot instances and had short enough training times that it was viable without fancy software to automatically transfer training to new machines.

Hyperparameter grid searches are inefficient

I've heard hyperparameter grid searches mentioned as a reason why ML research needs way more compute than it would appear based on the training time of the models used. However, I can also see the use of grid searches as evidence of an abundance of compute rather than a scarcity.

As far as I can tell it's possible to find hyperparameters much more efficiently than a grid search, it just takes more human time and engineering implementation effort. There's a large literature of more efficient hyperparameter search methods but as far as I can tell they aren't very popular (I've never heard of anyone using one in practice, and all open source implementations of these kind of things I can find have few Github stars).

Researcher Leslie Smith also has a number of papers with little-used ideas on principled approaches to choosing and searching for optimal hyperparameters with much less effort, including a fast automatic procedure for finding optimal learning rates. This suggests that it's possible to substitute hyperparameter search time for more engineering, human decision-making and research effort.

There's also likely room for improvement in the factorization of the hyper-parameters we use so that they're more amenable to separate optimization. For example, L2 regularization is usually used in place of weight decay because they theoretically do the same thing, but this paper points out that not only do they not do the same thing with ADAM and using weight decay causes ADAM to surpass the more popular SGD with momentum in practice, but that weight decay is a better hyper-parameter since the optimal weight decay is more independent of learning rate than L2 regularization strength is.

All of this suggests that most researchers might be operating under an abundance of cheap compute relative to their problems that leads to them not investing the effort required to more efficiently optimize their hyperparameters and just do so haphazardly or with grid searches instead.

The types of compute we need may not improve very quickly

Improvements in computing hardware are not uniform and there are many different hardware attributes that can be bottlenecks for different things. AI progress may rely on one or more of these that don't end up improving quickly, becoming bottlenecked on the slowest one rather than experiencing exponential growth.

Machine learning accelerators

Modern machine learning is largely composed of large operations that are either directly matrix multiplies or can be decomposed into them. It's also possible to train using much lower precision than full 32-bit floating point using some tricks. This allows the creation of specialized training hardware like Google's TPUs and Nvidia Tensor Cores. A number of other companies have also announced they're working on custom accelerators.

The first generation of specialized hardware delivered a large one-time improvement, but we can also expect continuing innovation in accelerator architecture. There will likely be sustained innovations in training with different number formats and architectural optimizations for faster and cheaper training. I expect this will be the area our compute capability will grow the most, but may flatten like CPUs have once we figure out enough of the easily discoverable improvements.

CPUs

Reinforcement learning simulations like the OpenAI Five DOTA bot, and various physics playgrounds, often use CPU-heavy serial simulations. OpenAI Five uses 128,000 CPU cores and only 256 GPUs. At current Google Cloud preemptible prices the CPUs cost 5-10x more than the GPUs in total. Improvements in machine learning training ability will still leave the large cost of the CPUs. If the use of expensive simulations that run best on CPUs becomes an important part of training advanced agents, progress may become bottlenecked on CPU cost.

Additionally, improvement in CPU compute costs may be slowing. Cloud CPU costs only decreased 45% from 2012 to 2017 and performance per dollar for buying the hardware only improved 2x.. Google Cloud Compute prices have only dropped 25% from 2014-2018. Although the introduction of preemptible prices 30% of full price in 2016 was a big improvement, and that decreased to 20% of full price in 2017.

GPU/accelerator memory

Another scarce resource is memory on the GPU/accelerator used for training. The memory must be large enough to store all the model parameters, the input, the gradients, and other optimization parameters.

This is one of the most frequent limits I see referenced in machine learning papers nowadays. For example the new large BERT language model can only be trained properly on TPUs with their 64GB of RAM. The Glow paper needs to use gradient checkpointing and an alternative to batchnorm so that they can use gradient accumulation, because only a single sample of gradients fits on a GPU.

However there are ways to address this limitation that aren't frequently used. Glow already uses the two best ones, gradient checkpointing and gradient accumulation, but did not implement an optimization they mentioned which would make the amount of memory the model takes constant in the number of layers instead of linear, likely because it would be difficult to engineer into existing ML frameworks. The BERT implementation uses none of the techniques because they just use a TPU with enough memory, in fact a reimplementation of BERT implemented 3 such techniques and got it to fit on a GPU. Thus it still seems that in a world with less RAM these might still have happened, just with more difficulty or smaller demonstration models.

Interestingly, the maximum available RAM per device barely changed from 2014 through 2017 with the NVIDIA K80's 24GB, but then shot up in 2018 to 48GB with the RTX 8000 as well as the 64GB TPU v2 and 128GB TPU v3. Probably both because of demand for larger device memories for machine learning training, as well as the availability of high capacity HBM memory. It's unclear to me if this rapid rise will continue or if it was mostly a one-time change reflecting new demands for the largest possible memories reaching the market.

It's also possible that per-device memory will cease to be a constraint on model size due to faster hardware interconnects that allow sharing a model across the memory of multiple devices like Intel's Nervana and Tensorflow Mesh plan to do. It also seems likely that techniques for splitting models across devices to fit in memory, like the original AlexNet did, will become more popular. It may be the case that the fact that we don't split models across devices like AlexNet anymore is evidence that we're not constrained by RAM much but I'm not sure.

Limited ability to exploit parallelism

As discussed extensively in a new paper from Google Brain, there seems to be a limit on how much data parallelism in the form of larger batch sizes we can currently extract out of a given model. If this constraint isn't worked around, wall time to train models could stall even if compute power continues to grow.

However the paper mentions that various things like model architecture and regularization affect this limit and I think it's pretty likely that techniques to increase this limit will continue to be discovered so it isn't a bottleneck. A newer paper by OpenAI finds that more difficult problems also tolerate larger batch sizes. Even if the limit remains, increasing compute would allow training more different models in parallel, potentially just meaning that more parameter search and evolution gets layered on top of the training. I also suspect that just using ever-larger models may allow use of more compute without increasing batch sizes.

At the moment, it seems that we know how to train effectively with batch sizes large enough to saturate large clusters, for example this paper about training ImageNet in 7 minutes with a 64k batch size. But this requires extra tuning and implementing some tricks, even just to train on mid-size clusters, so as far as I know only a small fraction of all machine learning researchers regularly train on large clusters (anecdotally, I'm uncertain about this).

Conclusion

These all seem to point towards compute being abundant and ideas being the bottleneck, but not solidly. For the points about training efficiency and grid searches this could just be an inefficiency in ML research and all the major AGI progress will be made by a few well-funded teams at the boundaries of modern compute that have solved these problems internally.

Vaniver [LW · GW] commented on a draft of this post that it's interesting to consider the case where training time is the bottleneck rather than ideas, but massive engineering effort is highly effective at reducing training time. In this case an increase in investment in AI research which lead to hiring more engineers to apply techniques to speed up training could lead to rapid progress. This world might also lead to more sizable differences in capabilities between organizations, if large somewhat serial software engineering investments are required to make use of the most powerful techniques, rather than a well-funded newcomer being able to just read papers and buy all the necessary hardware.

The course of various compute hardware attributes seems uncertain both in terms of how fast they'll progress and whether or not we'll need to rely on anything other than special-purpose accelerator speed. Since the problem is complex with many unknowns, I'm still highly uncertain, but all of these points did move me to varying degrees in the direction of continuing compute growth not being a driver of dramatic progress.

Thanks to Vaniver [LW · GW] and Buck Shlegeris for discussions that lead to some of the thoughts in this post.

10 comments

Comments sorted by top scores.

comment by rohinmshah · 2018-12-24T10:19:49.594Z · score: 17 (6 votes) · LW · GW

I think the evidence in the first part suggesting an abundance of compute is mostly explained by the fact that academics expect that we need ideas and algorithmic breakthroughs rather than simply scaling up existing algorithms, so you should update on that fact rather than this evidence which is a downstream effect. If we condition on AGI requiring new ideas or algorithms, I think it is uncontroversial that we do not require huge amounts of compute to test out these new ideas.

The "we are bottlenecked on compute" argument should be taken as a statement about how to advance the state of the art in big unsolved problems in a sufficiently general way (that is, without encoding too much domain knowledge). Note that ImageNet is basically solved, so it does not fall in this category. At this point, it is a "small" problem and it's reasonable to say that it has an overabundance of compute, since it requires four orders of magnitude less compute than AlphaGo (and probably Dota). For the unsolved general problems, I do expect that researchers do use efficient training tricks where they can find them, and they probably optimize hyperparameters in some smarter way. For example, AlphaGo's hyperparameters were trained via Bayesian optimization.

Particular narrow problems can be solved by adding domain knowledge, or applying an existing technique that no one had bothered to do before. Particular new ideas can be tested by building simple environments or datasets in which those ideas should work. It's not surprising that these approaches are not bottlenecked on compute.

The evidence in the first part can be explained as follows, assuming that researchers are focused on testing new ideas:

  • New ideas can often be evaluated in small, simple environments that do not require much compute.
  • Any trick that you apply makes it harder to tell what effect your idea is having (since you have to disentangle it from the effect of the trick).
  • Many tricks do not apply in the domain that the new idea is being tested in. Supervised learning has a bunch of tricks that now seem to work fairly robustly, but this is not so with reinforcement learning.
Jeremy Howard notes that he doesn't believe large amounts of compute are required for important ML research, and notes that many foundational discoveries were available with little compute.

I would assume that Jeremy Howard thinks we are bottlenecked on ideas.

For the points about training efficiency and grid searches this could just be an inefficiency in ML research and all the major AGI progress will be made by a few well-funded teams at the boundaries of modern compute that have solved these problems internally.

This seems basically right. I'd note that there can be a balance, so it's not clear that this is an "inefficiency" -- you could believe that any actual AGI will be developed by well-funded teams like you describe, but they will use some ideas that were developed by ML research that doesn't require huge amounts of compute. It still seems consistent to say "compute is a major driver of progress in AI research, and we are bottlenecked on it".

comment by waveman · 2018-12-21T08:57:40.890Z · score: 10 (5 votes) · LW · GW

Suggestion to test your theory: Look at the best AI results of the last 2 years and try to run them / test them in a reasonable time on a computer that was affordable 10 years ago.

My own opinion is that hardware capacity has been a huge constraint in the past. We are moving into an era where it is less of a problem. But, I think, still a problem. Hardware limitations infect and limit your thinking in all sorts of ways and slow you down terribly.

To take an example from my own work. I have a problem that needs about 50Gb RAM to test efficiently. Otherwise it does not fit in memory and the run time is 100X slower.

I had the option to spend 6 months maybe finding a way to squeeze it into 32Gb. Or, what I did: spend a few thousand on a machine with 128Gb RAM. To run in 1Gb RAM would have been a world of pain, maybe not doable in the time I have to work on it.

comment by abramdemski · 2018-12-27T16:38:19.289Z · score: 6 (3 votes) · LW · GW

I enjoyed the discussion. My own take is that this view is likely wrong.

  • The "many ways to train that aren't widely used" is evidence for alternatives which could substitute for a certain amount of hardware growth, but I don't see it as evidence that hardware doesn't drive growth.
  • My impression is that alternatives to grid search aren't very popular because alternatives don't really work reliably. Maybe this has changed and people haven't picked up on it yet. Or maybe alternatives take more effort than they're worth.

The fact that these things are fairly well known and still not used suggests that it is cheaper to pick up more compute rather than use them. You discuss these things as evidence that computing power is abundant. I'm not sure how to quantify that. It seems like you mean for "computing power is abundant" to be an argument against "computing power drives progress".

  • "computing power is abundant" could mean that everyone can run whatever crazy idea they want, but the hard part is specifying something which does something interesting. This is quite relative, though. Computing power is certainly abundant compared to 20 years ago. But, the fact that people pay a lot for computing power to run large experiments means that it could be even more abundant than it is now. And, we can certainly write down interesting things which we can't run, and which would produce more intelligent behavior if only we could.
  • "computing power is abundant" could mean that buying more computing power is cheaper in comparison to a lot of low-hanging-fruit optimization of what you're running. This seems like what you're providing evidence for (on my interpretation -- I'm not imagining this is what you intend to be providing evidence for). This to me sounds like an argument that computing power drives progress: when people want to purchase capability progress, they often purchase computing power.

I do think that your observations suggest that computing power can be replaced by engineering, at least to a certain extent. So, slower progress on faster/cheaper computers doesn't mean correspondingly slower AI progress; only somewhat slower.

comment by Vaniver · 2018-12-20T01:12:08.056Z · score: 5 (3 votes) · LW · GW

Elaborating on my comment (on the world where training time is the bottleneck, and engineers help):

To the extent major progress and flashy results are dependent on massive engineering efforts, that this seems like this lowers the portability of advances and makes it more difficult for teams to form coalitions. [Compare to a world where you just have to glue together different conceptual advances, and so you plug one model into another and are basically done.] This also means we should think about how progress happens in other fields with lots of free parameters that are sort of optimized jointly--semiconductor manufacturing is the primary thing that comes to mind, where you have about a dozen different fields of engineering that are all constrained by each other and the joint tradeoffs are sort of nightmarish to behold or manage. [Subfield A would be much better off if we switched from silicon to germanium, but everyone else would scream--but perhaps we'll need to switch eventually anyway.] The more bloated all of these projects become, the harder it is to do fundamental reimaginings of how these things work (a favorite example of mine here is replacing matmuls in neural networks with bitshifts, also known as "you only wanted the ability to multiply by powers of 2, right?", which seems like it is ludicrously more efficient and is still pretty trainable, but requires thinking about gradient updates differently, and the more effort you've put into optimizing how you pipe gradient updates around, the harder it is to make transitions like that).

This is also possibly quite relevant to safety; if it's hard to 'tack on safety' at the end, then it's important we start with something safe and then build a mountain of small improvements for it, rather than building the mountain of improvements for something that turns out to be not safe and then starting over.

comment by Vaniver · 2018-12-20T01:21:46.079Z · score: 4 (2 votes) · LW · GW

When it comes to the 'ideas' vs. 'compute' spectrum:

It seems to me like one of the main differences (but probably not the core one?) is whether or not whether or not something works seems predictable. Suppose Alice thinks that it's hard to come up with something that works, but things that look like they'll work do with pretty high probability, and suppose Bob thinks it's easy to see lots of things that might work, but things that might work rarely do; I think Alice is more likely to think we're ideas-limited (since if we had a textbook from the future, we could just code it up and train it real quick) and Bob is more likely to think we're compute-limited (since our actual progress is going to look much more like ruling out all of the bad ideas that are in between us and the good ideas, and the more computational experiments we can run, the faster that process can happen).

I tend to be quite close to the end of the 'ideas' spectrum, tho the issue is pretty nuanced and mixed.

I think one of the things that's interesting to me is not how much training time can be optimized, but 'model size'--what seems important is not whether our RL algorithm can solve a double-pendulum lightning-quick but whether we can put the same basic RL architecture into an octopus's body and have it figure out how to control the tentacles quickly. If the 'exponential effort to get linear returns' story is true, even if we're currently not making the most of our hardware, gains of 100x in utilization of hardware only turn into 2 higher steps in the return space. I think the primary thing that inclines me towards the 'ideas will drive progress' view is that if there's a method that's exponential effort to linear returns and another method that's, say, polynomial effort to linear returns, the second method should blow past the exponential one pretty quickly. (Even something that reduces the base of the exponent would be a big deal for complicated tasks.)

If you go down that route, then I think you start thinking a lot about the efficiency of other things (like how good human Go players are at turning games into knowledge) and what information theory suggests about strategies, and so on. And you also start thinking about how close we are--for a lot of these things, just turning up the resources plowed into existing techniques can work (like beating DotA) and so it's not clear we need to search for "phase change" strategies first. (Even if you're interested in, say, something like curing cancer, it's not clear whether continuing improvements to current NN-based molecular dynamics predictors, causal network discovery tools, and other diagnostic and therapeutic aids will get to the finish line first as opposed to figuring out how to build robot scientists and then putting them to work on curing cancer.)

comment by paulfchristiano · 2018-12-20T01:45:40.564Z · score: 2 (1 votes) · LW · GW
Some of this was just using more and better hardware, the winning team used 128 V100 GPUs for 18 minutes and 64 for 19 minutes, versus eight K80 GPUs for the baseline. However, substantial improvements were made even on the same hardware. The training time on a p3.16xlarge AWS instance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months.

Was the original 15 hour time for fp16 training, or fp32?

(A factor of 5 in a few months seems plausible, but before updating on that datapoint it would be good to know if it's just from switching to tensor cores which would be a rather different narrative.)

comment by Kythe · 2018-12-20T02:26:49.772Z · score: 3 (2 votes) · LW · GW

I just checked and seems it was fp32. I agree this makes it less impressive, I forgot to check that originally. I still think this somewhat counts as a software win, because working fp16 training required a bunch of programmer effort to take advantage of the hardware, just like optimization to make better use of cache would.

However, there's also a different set of same-machine datapoints available in the benchmark, where training time on a single Cloud TPU v2 went down from 12 hours 30 minutes to 2 hours 44 minutes, which is a 4.5x speedup similar to the 5x achieved on the V100. The Cloud TPU was special-purpose hardware being trained with bfloat16 from the start, so that's a similar magnitude improvement more clearly due to software. The history shows incremental progress down to 6 hours and then a 2x speedup once the fast.ai team published and the Google Brain team incorporated their techniques.

comment by paulfchristiano · 2018-12-20T21:21:16.404Z · score: 11 (3 votes) · LW · GW

I think that fp32 -> fp16 should give a >5x boost on a V100, so this 5x improvement still probably hides some inefficiencies when running in fp16.

I suspect the initial 15 - > 6 hour improvement on TPUs was also mostly dealing with low hanging fruit and cleaning up various inefficiencies from porting older code to a TPU / larger batch size / etc.. It seems plausible the last factor of 2 is more of a steady state improvement, I don't know.

My take on this story would be: "Hardware has been changing rapidly, giving large speedups, and people at the same time people have been scaling up to larger batch sizes in order to spend more money. Each time hardware or scale changes, old software is poorly adapted, and it requires some engineering effort to make full use of the new setup." On this reading, these speedups don't provide as much insight into whether future progress will be driven by hardware.

comment by Kythe · 2018-12-20T22:51:03.813Z · score: 1 (1 votes) · LW · GW

I went and checked and as far as I can tell they used the same 1024 batch size for the 12 and 6 hour time. The changes I noticed were better normalization, label smoothing, a somewhat tweaked input pipeline (not sure if optimization or refactoring) and updating Tensorflow a few versions (plausibly includes a bunch of hardware optimizations like you're talking about).

The things they took from fast.ai for the 2x speedup were training on progressively larger image sizes, and the better triangular learning rate schedule. Separately for their later submissions, which don't include a single-GPU figure, fast.ai came up with better methods of cropping and augmentation that improve accuracy. I don't necessarily think the 2x speedup pace through clever ideas pace is sustainable, lots of the fast.ai ideas seem to be pretty low hanging fruit.

I basically agree with the quoted part of your take, just that I don't think it explains enough of the apathy towards training speed that I see, although I think it might more fully explain the situation at OpenAI and DeepMind. I'm making more of a revealed preferences efficient markets kind of argument where I think the fact that those low hanging fruits weren't picked and aren't incorporated into the vast majority of deep learning projects suggests that researchers are sufficiently un-constrained by training times that it isn't worth their time to optimize things.

Like I say in the article though, I'm not super confident and I could be underestimating the zeal for faster training because of sampling error of what I've seen, read and thought of, or it could just be inefficient markets.

comment by Kythe · 2018-12-20T19:25:46.342Z · score: 1 (1 votes) · LW · GW

A relevant paper came out 3 days ago talking about how AlphaGo used Bayesian hyperparameter optimization and how that improved performance: https://arxiv.org/pdf/1812.06855v1.pdf

It's interesting to set the OpenAI compute article's graph to linear scale so you can see that the compute that went into AlphaGo utterly dwarfs everything else. It seems like DeepMind is definitely ahead of nearly everyone else on the engineering effort and money they've put into scaling.