Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims

post by garrison · 2024-11-13T17:00:01.005Z · LW · GW · 13 comments

This is a link post for https://garrisonlovely.substack.com/p/is-deep-learning-actually-hitting

Contents

13 comments

13 comments

Comments sorted by top scores.

comment by Vladimir_Nesov · 2024-11-13T18:47:23.575Z · LW(p) · GW(p)

Scaling progress is constrained by the physical training systems[1]. The scale of the training systems is constrained by funding. Funding is constrained by the scale of the tech giants and by how impressive current AI is. Largest companies backing AGI labs are spending on the order of $50 billion a year on capex (building infrastructure around the world). The 100K H100s clusters that at least OpenAI, xAI, and Meta recently got access to cost about $5 billion. The next generation of training systems is currently being built, will cost $25-$40 billion each (at about 1 gigawatt), and will become available in late 2025 or early 2026.

Without a shocking level of success, for the next 2-3 years the scale of the training compute that the leading AGI labs have available to them is out of their hands, it's the systems they already have or the systems already being built. They need to make the optimal use of this compute in order to secure funding for the generation of training systems that come after and will cost $100-$150 billion each (at about 5 gigawatts). The decisions about these systems will be made in the next 1-2 years, so that they might get built in 2026-2027.

Thus paradoxically there is no urgency for the AGI labs to make use of all their compute to improve their products in the next few months. What they need instead is to maximize how their technology looks in a year or two, which motivates more research use of compute now, rather than immediately going for the most scale current training systems enable. One exception might be xAI, which still needs to raise money for the $25-$40 billion training system. And of course even newer companies like SSI, but they don't even have the $5 billion training systems to demonstrate their current capabilities unless they do something sufficiently different.


  1. Training systems are currently clusters located on a single datacenter campus. But this might change soon, possibly even in 2025-2026, which lets the power needs at each campus remain manageable. ↩︎

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-13T20:23:54.811Z · LW(p) · GW(p)

https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy [LW · GW]

Vladimir makes an excellent point that it's simply too soon to tell whether the next gen (eg gpt5) of LLMs will fizzle. I do think there's reasonable evidence for suspecting that the generation AFTER that (eg gpt6) won't be a straightforward scale up of gpt4. I think we're in a compute and data overhang for AGI, and that further parameter, compute, and data scaling beyond gpt5 level would be a waste of money.

The real question is whether gpt5 gen models will be just enough more capable than current ones to substantially increase the rate of the true limiting factor: algorithmic improvement.

comment by Noosphere89 (sharmake-farah) · 2024-11-13T20:05:57.825Z · LW(p) · GW(p)

I just want to provide one important piece of information:

It turns out that Ilya Sutskever was misinterpreted as a claim about the model plateauing, but instead saying other directions work out better:

https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/?commentId=JFNZ5MGZnzKRtFFMu [LW · GW]

Replies from: Vladimir_Nesov, garrison
comment by Vladimir_Nesov · 2024-11-13T21:09:26.269Z · LW(p) · GW(p)

turns out that Ilya Sutskever was misinterpreted

That's not exactly my claim [LW · GW]. If he said more to the reporters than his words quoted in the article[1], then it might've been justified to interpret him as saying that pretraining is plateauing. The article isn't clear on whether he said more. If he said nothing more, then the interpretation about plateauing doesn't follow, but could in principle still be correct.

Another point is that Sutskever left OpenAI before they trained the first 100K H100s model, and in any case one datapoint of a single training run isn't much evidence. The experiment that could convincingly demonstrate plateauing hasn't been performed yet [LW(p) · GW(p)]. Give it at least a few months, for multiple labs to try and fail.


  1. “The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.”

    ↩︎
Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-11-13T21:26:20.734Z · LW(p) · GW(p)

I definitely agree that people are overupdating too much from this training run, and we will need to wait.

(I also made this mistake in overupdating.)

comment by garrison · 2024-11-13T22:10:58.442Z · LW(p) · GW(p)

I think this is a misunderstanding of the piece and how journalists typically paraphrase things. The reporters wrote that Ilya told them that results from scaling up pre-training have plateaued. So he probably said something to that effect, but for readability and word-count reasons, they paraphrased it. 

If a reported story from a credible outlet says something like X told us that Y, then the reporters are sourcing claim Y to X, whether or not they include a direct quote. 

The plateau claim also jives with The Information story about OpenAI, as well as a few other similar claims made by people in industry. 

Ilya probably spoke to the reporter(s) for at least a few min, so the quotes you see are a tiny fraction of everything he said. 

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-11-13T22:15:20.041Z · LW(p) · GW(p)

Fair enough, I'll retract my comment.

comment by Steven Byrnes (steve2152) · 2024-11-13T18:16:34.423Z · LW(p) · GW(p)

I’d bet that Noam Brown’s TED AI talk has a lot of overlap with this one that he gave in May. So you don’t have to talk about it second-hand, you can hear it straight from the source.  :) In particular, the “100,000×” poker scale-up claim is right near the beginning, around 6 minutes in.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-11-13T18:55:58.245Z · LW(p) · GW(p)

There is a newer post-o1 Noam Brown talk from Sep 2024 that covers similar ground.

Replies from: garrison
comment by garrison · 2024-11-14T16:16:32.681Z · LW(p) · GW(p)

Thanks for these!

comment by RogerDearnaley (roger-d-1) · 2024-11-17T05:24:57.031Z · LW(p) · GW(p)

If these rumors are true, it sounds like we’re already starting to hit the issue I predicted in LLMs May Find It Hard to FOOM [LW · GW]. The majority of content on the Internet isn’t written by geniuses with post-doctoral experience, so we’re starting to run out of the highest-quality training material for getting LLMs past doctoral student performance levels. However, as I describe there, this isn’t a wall, it’d just a slowdown: we need to start using AI to generate a lot more high-quality training data, As o1 shows, that’s entirely possible, using inference-time compute scaling and then training on the results. We're having AI do the equivalent of System 2 thinking (in contexts where we can check the results are accurate), and then attempting to train a smarter AI that can solver the same problems by System 1 thinking.

However, this might be enough to render fast takeoff unlikely, which from an alignment point of view would be an excellent thing.

Now we just need to make sure all that synthetic training data we’re having the AI generate is well aligned.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-11-17T14:56:29.575Z · LW(p) · GW(p)

This is also my interpretation of the rumors, assuming they are true, which I don't put much probability on.

comment by Maxime Riché (maxime-riche) · 2024-11-13T23:10:18.109Z · LW(p) · GW(p)

People may be blind to the fact that improvements from gpt2 to 3 to 4 were both driven by scaling training compute (by 2 OOM between each generation) and (the hidden part) by scaling test compute through long context and CoT (like 1.5-2 OOM between each generations too).

If gpt5 uses just 2 OOM more training compute than gpt4 but the same test compute, then we should not expect "similar" gains, we should expect "half".

O1 may use 2 OOM more test compute than gpt4. So gpt4=>O1+gpt5 could be expected to be similar to gpt3=>gpt4