We have achieved Noob Gains in AI

post by phdead · 2022-05-18T20:56:49.143Z · LW · GW · 20 comments

Contents

  What has changed in AI research in the past three years?
  Why and how has it changed?
    Hardware advances
    Software advances
    Research innovations
  How have those underlying factors changed in the past three years?
  Conclusion.
None
20 comments

TL;DR I explain why I think AI research has been slowing down, not speeding up, in the past few years.

How have your expectations for the future of AI research changed in the past three years? Based on recent posts in this forum, it seems that results in text generation, protein folding, image synthesis, and other fields have accomplished feats beyond what was thought possible. From a bird's eye view, it seems as though the breakneck pace of AI research is already accelerating exponentially, which would make the safe bet on AI timelines quite short.

This way of thinking misses the reality on the front lines of AI research. Innovation is stalling beyond just throwing more computation at the problem, and the forces that made scaling computation cheaper or more effective are slowing. The past three years of AI results have been dominated by wealthy companies throwing very large models at novel problems. While this expands the economic impact of AI, it does not accelerate AI development.

To figure out whether AI development is actually accelerating, we need to answer a few key questions:

  1. What has changed in AI in the past three years?
  2. Why has it changed, and what factors have allowed that change?
  3. How have those underlying factors changed in the past three years?

By answering these fundamental questions, we can get a better understanding of how we should expect AI research to develop over the near future. And maybe along the way, you'll learn something about lifting weights too. We shall see.

What has changed in AI research in the past three years?

Gigantic models have achieved spectacular results on a large variety of tasks.

How large is the variety of tasks? In terms of domain area, quite varied. Advances have been made in major hard science problems like protein synthesis, imaginative tasks like creating images from descriptions, and playing complex games like Starcraft.

How large is the variety of models used? While each model features many domain specific model components and training components, the core of each of these models is a giant transformer trained with a variant of gradient descent, usually ADAM.

How large are these models? That depends. DALLE2 and AlphaFold are O(10GB), AlphaStar is O(1GB), and the current state of the art few shot NLP models (Chinchilla) are O(100GB).

One of the most consistent findings of the past decade of AI research is that larger models trained with more data get better results, especially transformers. If all of these models are built on top of the same underlying architecture, why is there so much variation in size?

Think of training models like lifting weights. What limits your ability to lift heavy weights?

Looked at this way, what has changed over the past three years? In short, we have discovered how to adapt a training method/exercise (the transformer) to a variety of use cases. This exercise allows us to engage our big muscles (scalable hardware and software optimized for transformers). Sure, some of these applications are more efficient than others, but overall they are way more efficient than what they were competing against. We have used this change in paradigm to "lift more weight", increasing the size and training cost of our model to achieve more impressive results.

(Think about how AlphaFold2 and Dalle-2, despite mostly being larger versions of their predecessors, drew more attention than their predecessors ever did. The prior work in the field paved the way by figuring out how to use transformers to solve these problems, and the attention comes from when they scaled the solution to achieve eye popping results. In our weightlifting analogy, we are learning a variation of an exercise. The hard part is learning the form that allows you to leverage the same muscles, but the impressive looking part is adding a lot of weight.)

Why and how has it changed?

In other words: why are we only training gigantic models and getting impressive results now?

There are many reasons for this, but the most important one is that no one had the infrastructure to train models of this size efficiently before.

Hardware advances

The modern Deep Learning / AI craze started in 2012, when a neural network called AlexNet won the imagenet challenge. The architecture used a convolutional neural network, a method that had been invented 25 years prior and had been deemed too impractical to use. What changed?

The short answer? GPUs happened. Modern graphics applications had made specialized hardware for linear algebra cheap and commercially available. Chips had gotten almost a thousand times faster during that period, following Moore's law. When combined with myriad other computing advances in areas such as memory and network interfaces, it might not be a stretch to say that the GPU which AlexNet ran on was a million times better for Convnets than what had been available when Convnets were invented.

As the craze took off, NVIDIA stared optimizing their GPUs more and more for deep learning workloads across the stack. GPUs were given more memory to hold larger models and more intermediate computations, faster interconnects to leverage multiple gpus at once, and more optimized primitives through CUDA and cuDNN. This enabled much larger models, but by itself would not have allowed for the absolutely giant models we see now.

Software advances

In the old days of deep learning, linear algebra operations had to be done by hand... well not really, but programming was a pain and the resulting programs used resources poorly. Switching to a GPU was a nightmare, and if trying to use multiple GPUs would make a pope question their faith in god. Then along came Caffe, then tensorflow, then pytorch, and suddenly training was so easy that any moron with an internet connection (me!) can use deep learning without understanding any of the math, hardware, or programming that needs to happen.

These days, training an ML model doesn't even require coding knowledge. If you do code, your code probably works on CPUs or GPUs, locally or on AWS/Azure/Google Cloud, and with one GPU or 32 across four different machines.

Furthermore, modern ML platforms will do under the hood optimization to accelerate model execution. Easy to write ML models now are easy to share, readable, well optimized, and can be scaled easily.

Research innovations

There are two sets of important advances that enabled large scale research. The first are a legion of improvements to methods that allowed for less computational to achieve more. The second are methods that allow for more computation to be thrown at the same problem for better and faster results. While there are also many advances in this field, three stick out: transformers, pipeline parallelism, and self supervised learning

Transformers are models that run really well on GPUs by leveraging very efficient matrix multiplication primitives. They were originally designed for text data, but it turns out that for models whose size is measured in gigabytes, transformers are just better than their competition for the same amount of computation.

If we think back to our weightlifting analogy, transformers are like your leg muscles. For sufficiently large loads, they can't be beat!

(As an ML Systems researcher, every new application where just throwing a big transformer at a problem beats years of custom machine learning approaches and panels of experts brings me a strange joy.)

Pipeline parallelism is a bit complicated to explain to a nontechnical audience, but the short version is that training a machine learning model requires much more memory on the GPU than the size of the model. For small models, splitting the data between GPUs is the best approach. For large models, splitting the model across GPUs is better. Pipeline parallelism is a much better way of splitting the model than prior approaches, especially for models which are larger than a gigabyte.

Pipeline parallelism is like having good form. For smaller lifts its not a big deal, but it is critical for larger lifts.

Self supervised learning is like making flashcards to study for a test. Ideally, your teacher would make you a comprehensive set of practice questions, but that takes a lot of effort for the teacher. A self directed student could take data that doesn't have "questions" (labels) and make up your own questions to learn the material. For example, a model trying to learn English could take a sentence, hide a bunch of words, and try to guess them. This is much cheaper than having a human "make questions".

Self supervised learning is like cooking your own food instead of hiring a personal chef for your nutritional needs. It might be worse, but it is so much cheaper, and for the same price you can make a lot more food!

How have those underlying factors changed in the past three years?

TL;DR not much. We haven't gotten stronger in the past four years, just did a bunch of different exercises which used the same muscles.

All of the advances I mentioned in the last section were from 2018 or earlier.

(For the purists, Self supervised learning went mainstream for vision in 2020 by finally outperforming supervised learning).

Chips are not getting twice as fast every two years like they used to (Moore's law is dying). The cost of a single training run for the largest ML models is on the order of ten million dollars. Adding more GPUs and more computation is pushing against the amount that companies are willing to burn on services that don't generate money for the company. Unlike the prior four years, we cannot scale up the size of models by a thousand times again. No one is willing to spend billions of dollars on training runs yet.

From a hardware perspective, we should expect the pace of innovation to slow in the coming years.

Software advances are mixed. Using ML models is becoming easier by the day. With libraries like huggingface, a single line of code can run a state of the art model for your particular use case. There is a lot of room for software innovations to make it easier to use for non technical audiences, but right now very little research is bottlenecked by software.

Research advances are the X factor. Lots of people are working on these problems, and its possible there is a magic trick for intelligence at existing compute budgets. However, that is and always was true. However, the most important research advances of the last few years primarily enabled us to use more GPUs for a given problem. Now that we are starting to run up against the limits of data acquisition and monetary cost, less low hanging fruit is available.

(Side note: Even facebook has trouble training current state of the art models. Here are some chronicles of them trying to train a GPT-3 size model).

Conclusion.

I don't think we should expect performance gains in AI to accelerate over the next few years. As a researcher in the field, I expect the next few years will involve a lot of advances in the "long tail" of use cases and have less growth in the most studied areas. This is because we have achieved the easy pickings gains from hardware and software over the past decade.

This is my first time posting to lesswrong, and I decided to post a lightly edited first draft because if I start doing heavy edits I don't stop. Every time I see a very fast AGI prediction or someone claiming Moore's law will last a few more decades I start to write something, but this time I actually finished it before deciding to rewrite. As a result, it isn't an airtight argument, but more my general feelings as someone who has been at two of the top research institutions in the world.

20 comments

Comments sorted by top scores.

comment by gwern · 2022-05-19T01:07:35.644Z · LW(p) · GW(p)

Alphastar was trained by creating a league of AlphaStars which competed against each other in actual games. To continue our weightlifting analogy, this is like a higher rep range with lower weight.

I think by this point your weightlifting analogy has started to obscure much more than clarify. (Speaking as something who just came back from doing some higher rep exercises with lower weight, I struggle to see how that was in any sense like the AlphaStar League PBT training.)


I disagree with the claim that progress has slowed down but I am also not too sure what you are arguing since you are redefining 'progress' to mean something other than 'quickly making way more powerful systems like AlphaFold or GPT-3', which you do agree with. To rephrase this more like the past scaling discussions, I think you are arguing something along the lines of

Recent 'AI progress' in DL is unsustainable because it was due not to fundamentals but picking low-hanging fruits, the one-time using-up of a compute overhang: it was largely driven by relatively small innovations like the Transformer which unlocked scaling, combined with far more money spent on compute to achieve that scaling - as we see in the 'AI And Compute' trend. This trend broke around when it was documented, and will not resume: PaLM is about as large as it'll get for the foreseeable future. The fundamentals remain largely unchanged, and if anything, improvement of those slowed recently as everyone was distracted picking the low-hanging fruits and applying them. Thus, the near future will be very disappointing to anyone extrapolating from the past few years, as we have returned to the regime where research ideas are the bottleneck, and not data/compute/money, and the necessary breakthrough research ideas will arrive unpredictably at their own pace.

Replies from: phdead
comment by phdead · 2022-05-19T01:44:04.371Z · LW(p) · GW(p)

The summary is spot on! I would add that the compute overhang was not just due to scaling, but also due to 30 years of Moore's law and NVidia starting to optimize their GPUs for DL workloads.

The rep range idea was to communicate that despite AlphaStar being much smaller than GPT as a model, the training costs of both were much closer due to the way AlphaStar was trained. Reading it now it does seem confusing.

I meant progress of research innovations. You are right though, from an application perspective the plethora of low hanging fruit will have a lot of positive effects on the world at large.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2022-05-20T13:16:17.528Z · LW(p) · GW(p)

I'm not certain if "the fundamentals remain largely unchanged" necessarily implies "the near future will be very disappointing to anyone extrapolating from the past few years", though. Yes, it's true that if the recent results didn't depend on improvements in fundamentals, then we can't use the recent results to extrapolate further progress in fundamentals. 

But on the other hand, if the recent results didn't depend on fundamentals, then that implies that you can accomplish quite a lot without many improvements on fundamentals. This implies that if anyone managed just one advance on the fundamental side, then that could again allow for several years of continued improvement, and we wouldn't need to see lots of fundamental advances to see a lot of improvement.

So while your argument reduces the probability of us seeing a lot of fundamental progress in the near future (making further impressive results less likely), it also implies that the amount of fundamental progress that is required is less than might otherwise be expected (making further impressive results more likely). 

Replies from: gwern, phdead
comment by gwern · 2022-05-20T18:14:56.342Z · LW(p) · GW(p)

This point has also been made before: predictions of short-term stagnation without also simultaneously bumping back AGI timelines would appear to imply steep acceleration at some point, in order for the necessary amounts of progress to 'fit' in the later time periods.

comment by phdead · 2022-05-22T19:27:47.364Z · LW(p) · GW(p)

The point I was trying to make is not that there weren't fundamental advances in the past. There were decades of advances in fundamentals that rocketed forward development at an unsustainable pace. The effect of this can be seen with sheer amount of computation that is being used for SOTA models. I don't forsee that same leap happening twice.

comment by Ilio · 2022-05-19T13:34:05.580Z · LW(p) · GW(p)

Pardon the half-sneering tone, but old nan can’t resist: « Oh, my sweet summer child, what do you know of fearing noob gains? Fear is for AI winter, my little lord, when the vanishing gradient problem was a hundred feet deep and the ice wind comes howling out of funding agencies, cutting every budget, dispersing the students, freezing the sparse spared researchers..

Seriously, three years is just a data point, and you want to conclude on the rate of change! I guess you would agree 2016-2022 saw more gains than 2010-2016, and not because the latter were boring times. I disagree that finding out what big transformers could do in the three last years was not a big deal, or even that this was low hanging fruits. I guess that it was low hanging fruits for you, because of the tools you were having access to, and I interpret your post as a deep and true intuition that the next step shall demand different tools (I vote for: « clever inferences from functional neuroscience & neuropsychology»). In any case, welcome on lesswrong and thanks for your precious input! (even if old nan was amazed you were expecting even faster progress!)

Replies from: phdead
comment by phdead · 2022-05-22T19:30:56.104Z · LW(p) · GW(p)

I am a young bushy eyed first year PhD. I imagine if you knew how much of a child of summer I was you would sneer on sheer principle, and it would be justified. I have seen a lot of people expecting eternal summer, and this is why I predict a chilly fall. Not a full winter, but a slowdown as expectations come back to reality.

Replies from: Ilio
comment by Ilio · 2022-05-24T23:05:22.194Z · LW(p) · GW(p)

I wish I was wise enough at your age to post my gut feeling on internet so that I could better update later. Well, internet did not exist, but you got the idea.

One question after gwern’s reformulation: do you agree that, in the past, technical progress in ML almost always came first (before fundamental understanding)? In other words, is the crux of your post that we should no longer hope for practical progress without truly understanding why what we do should work?

comment by the gears to ascension (lahwran) · 2022-05-19T20:55:07.455Z · LW(p) · GW(p)

half-serious: indeed, but noob gains are all you need

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-05-19T18:53:54.232Z · LW(p) · GW(p)

Ben Garfinkel made basically this same point in 2019.

Serious forecasts of AI timelines (such as Ajeya's and mine) already factor this in.

comment by p.b. · 2022-05-19T05:00:59.757Z · LW(p) · GW(p)

I think one can argue that the pace of innovation has been slowing down. But this is due to transformers crowding everything out. And this is due to transformers apparently being able to model pretty much anything very well. Which is exactly the realisation that makes AGI timelines shorter. 

comment by Morpheus · 2022-05-18T21:39:00.294Z · LW(p) · GW(p)

I would not have expected progress to have sped up [LW · GW]. But I agree that lots of recent progress could be naively Interpreted this way. So it makes sense to keep in mind that the current deep learning paradigm might come to a halt. Though the thing that worries me is that deep learning already has enough momentum to get us to AGI while slowing down.

Replies from: phdead
comment by phdead · 2022-05-18T23:49:43.013Z · LW(p) · GW(p)

Out of curiosity, what is your reasoning behind believing that DL has enough momentum to reach AGI?

Replies from: Morpheus
comment by Morpheus · 2022-05-19T12:30:48.242Z · LW(p) · GW(p)

Mostly abstract arguments that don't actually depend on DL in particular (or at least not to a strong degree). Eg. stupid evolution was able to do it with human brains. This spreadsheet is nice for playing with the implications for different models (couldn't find Ajeya's report this belongs to). Though I haven't taken the time to thoroughly think through this, because playing through resonable values gave distributions that seemed to broad to bother.

The point I wanted to make is that you can believe that things are slowing down (I am more empathetic to the view where AI will not have a big/galactic impact until things are too late) and still be worried.

comment by ryan_b · 2022-05-20T16:46:30.149Z · LW(p) · GW(p)

I suppose the obvious follow up question is: do you think there are any interesting ideas being pursued currently? Even nascent ones?

Two that I (a layperson) find interesting are the interpretability/transparency and neural ODE angles, though both of these are less about capability than about understanding what makes capability work at all.

comment by Chris_Leong · 2022-05-19T13:18:38.704Z · LW(p) · GW(p)

A few counterpoints (please note that I'm definitely not an expert on AI, so take with a grain of salt):

  • There seems to have been a lot more progress recently. I suspect that part of this is due to DeepMind and OpenAI having parallelised their operations. Instead of one big release per year, they seem to have multiple projects producing a payoff each year.
  • Some kinds of progress become easier as you gain access to more powerful systems and once you have access to a powerful enough system some kinds of progress become relatively easy. Convolutions, as opposed to hardcoding visual features, only became viable once we had systems powerful enough to simultaneously learn the features and how to combine them. The solution was pretty much "Just let gradient descent handle it automatically". My expectation is that there are all kinds of schemes for improving AI that wouldn't have worked in the past, but which are already viable or which will become viable soon. Similarly, Gato seems to have pretty much been "Train an agent to imitate the answers of a bunch of expert systems"; this approach likely wouldn't have worked so well in the past due to catastrophic forgetting, but once you have a powerful enough system, it seems to just work.
  • Even if we start hitting scaling limits, many of the factors that have spurred on the development of AI will remain including:  AI systems having become commercially valuable, the incredible amount of talent being drawn into the field and the abundance of tools to make working with AI easier. So even if it regresses somewhat,  we should expect progress to remain above the previous baseline.
Replies from: p.b.
comment by p.b. · 2022-05-20T16:13:34.113Z · LW(p) · GW(p)

Deepmind has hundreds of researchers and OpenAI also has several groups working on different things. That hasn't changed much.

Video generation will become viable and a dynamic visual understanding will come with it. Maybe then robotics will take off.

Yeah, I think there is so much work going on that it is not terribly unlikely that when the scaling limit is reached the next steps already exist and only have to be adopted by the big players. 

comment by Stephen McAleese (stephen-mcaleese) · 2022-05-21T23:16:27.636Z · LW(p) · GW(p)

My understanding of your argument is that AI progress will slow down in the future because the low-hanging fruit in hardware, software, and research have been exhausted.

Hardware: researchers have scaled models to the point where they cost millions of dollars. At this point, scaling them further is difficult. Moore's Law is slowing down, making it harder to scale models. 

In my opinion, it seems unlikely, but not inconceivable, that training budgets will increase further. It could happen if more useful models result in greater financial returns and investment in a positive feedback loop. Human labor is expensive and creating an artificial replacement could still be profitable even with large training costs. Another possibility is that government investment increases AI training budgets in some kind of AI manhattan project. Though this possibility doesn't seem likely to me given that most progress has occurred in private companies in recent years.

I'm somewhat less pessimistic about the death of Moore's Law. Although it's getting harder to improve chip performance, there is still a strong incentive to improve it. We at least know that it's possible for improvements to continue because current technology is not near the physical (1).

Software: has improved a lot in recent years. For example, libraries such as HuggingFace have made it much easier to use the latest models. The post argues that research is not bottlenecked by progress in software.

This point seems valid to me. However, better AI-assisted programming tools in the future could increase the rate of software development even more.

Research: transformers, pipeline parallelism, and self-supervised learning have made it possible to train large models with much better performance. The post also says that many of these innovations (e.g. the transformer) are from 2018 or earlier.

New techniques are introduced, they mature and are replaced by newer techniques. For example, progress in CPU speed stagnated and GPUs increased performance dramatically. TPUs have improved on GPUs and we'll probably see further progress. If this is true, then some of the AI techniques that will be commonplace in several years are probably already under development but not mature enough to be used in mainstream AI. Instead of a global slowdown in AI, I see AI research progress as a series of s-curves.

I can't imagine how future architectures will be different but the historical trend has always been that new and better techniques replace old ones.

As more money and talent are invested in AI research, progress should accelerate given a fixed difficulty in making progress. Even if the problems become harder to solve, increased talent and financial investment should offset the increase in difficulty. Therefore, it seems like the problems would have to become much harder for AI progress to slow down significantly which doesn't seem likely to me given how new the field is.

Given that deep learning has only been really popular for about ten years, it seems unlikely that most of the low-hanging fruit have already been extracted, unlike particle physics which has been around for decades and where particle colliders have had diminishing returns.

Overall, I'm more bullish on AI progress in the future than this post and I expect more significant progress to occur.

1: https://en.wikipedia.org/wiki/Landauer%27s_principle

comment by michael_mjd · 2022-05-21T07:11:36.226Z · LW(p) · GW(p)

As a ML engineer, I think it's plausible. I also think there are some other factors that could act to cushion or mitigate slowdown. First, I think there are more low hanging fruit available. Now that we've seen what large transformer models can do on the text domain, and in a text-to-image Dall-E model, I think the obvious next step is to ingest large quantities of video data. We often talk about the sample inefficiency of modern methods as compared with humans, but I think humans are exposed to a TON of sensory data in building their world model. This seems an obvious next step. Though if hardware really stalls, maybe there won't be enough compute or budget to train a 1T+ parameter multimodal model.

The second mitigating factor I think may be that funding has already been unlocked, to some extent. There is now a lot more money going around for basic research, possibly to the next big thing. The only thing that might stop it is maybe academic momentum into the wrong directions. Though from an x-risk standpoint, maybe that's not a bad thing, heh.

In my mental model, if the large transformer models are already good enough to do what we've shown them to be able to do, it seems possible that the remaining innovations would be more on the side of engineering the right submodules and cost functions. Maybe something along the lines of Yann LeCun's recent keynotes.

comment by Faustine Li (faustine-li) · 2022-05-19T16:43:51.064Z · LW(p) · GW(p)

I agree that most of the recent large model gains have been due to the surplus of compute and data, and theory and technique will have to catch up eventually ... what I'm not convinced on is why that would necessarily be slow.

I would argue there's a theory and technique overhang with self-supervised learning being just one area of popular research. We haven't needed to dip very deeply yet since training bigger transformers with more data "just works."

There's very weak evidence that we're hitting the limits of deep learning itself or even just the transformer architecture. Ultimately, that is the real limiter ... certainly data and compute are the conceptually easier problems to solve. Maybe in the short-term that's enough.