Slow corporations as an intuition pump for AI R&D automation

post by ryan_greenblatt, elifland · 2025-05-09T14:49:38.987Z · LW · GW · 23 comments

Contents

  The intuition pump
  Clarifications
  Asymmetries
  Implications
None
24 comments

How much should we expect AI progress to speed up after fully automating AI R&D? This post presents an intuition pump for reasoning about the level of acceleration by talking about different hypothetical companies with different labor forces, amounts of serial time, and compute. Essentially, if you'd expect an AI research lab with substantially less serial time and fewer researchers than current labs (but the same cumulative compute) to make substantially less algorithmic progress, you should also expect a research lab with an army of automated researchers running at much higher serial speed to get correspondingly more done. (And if you'd expect the company with less serial time to make similar amounts of progress, the same reasoning would also imply limited acceleration.) We also discuss potential sources of asymmetry which could break this correspondence and implications of this intuition pump.

The intuition pump

Imagine theoretical AI companies with the following properties:

SlowCorp NormalCorp
Analog to NormalCorp with 50x slower, 5x less numerous employees, and lower ceiling on employee quality Future frontier AI company
Time to work on AI R&D 1 week 1 year
Number of AI researchers and engineers 800 4,000
Researcher/engineer quality Median frontier AI company researcher/engineer Similar to current frontier AI companies if they expanded rapidly[1]
H100s 500 million 10 million
Cumulative H100-years 10 million 10 million

NormalCorp is similar to a future frontier AI company. SlowCorp is like NormalCorp except with 50x less serial time, a 5x smaller workforce, and lacking above median researchers/engineers.[2] How much less would SlowCorp accomplish than NormalCorp, i.e. what fraction of NormalCorp's time does it take to achieve the amount of algorithmic progress that SlowCorp would get in a week?

SlowCorp has 50x less serial labor, 5x less parallel labor, as well as reduced labor quality. Intuitively, it seems like it should make much less progress than NormalCorp. My guess is that we should expect NormalCorp to achieve SlowCorp's total progress in at most roughly 1/10th of its time.

Now let's consider an additional corporation, AutomatedCorp, which is an analog for a company sped up by AI R&D automation.

SlowCorp NormalCorp AutomatedCorp
Analog to NormalCorp with 50x slower, 5x less numerous employees, and lower ceiling on employee quality Future frontier AI company Future company with fully automated AI R&D
Time to work on AI R&D 1 week 1 year 50 years
Number of AI researchers and engineers 800 4,000 200,000
Researcher/engineer quality Median frontier AI company researcher/engineer Similar to current frontier AI companies if they expanded rapidly[3] Level of world's 100 best researchers/engineers
H100s 500 million 10 million 200,000[4]
Cumulative H100-years 10 million 10 million 10 million

AutomatedCorp is like NormalCorp except with 50x more serial time, a 50x larger workforce, and only world-class researchers and engineers. The jump from NormalCorp to AutomatedCorp is like the jump from SlowCorp to NormalCorp but with 10x more employees, and with the structure of the increase in labor quality being a bit different.

It seems like the speedup from NormalCorp to AutomatedCorp should be at least similar to the jump from SlowCorp to NormalCorp, i.e. at least roughly 10x. My best guess is around 20x.

AutomatedCorp is an analogy for a hypothetical AI company with AI researchers that match the best human researcher while having 200k copies that are each 50x faster than humans.[5] If you have the intuition that a downgrade to SlowCorp would be very hobbling while this level of AI R&D automation wouldn't vastly speed up progress, consider how to reconcile this.

That's the basic argument. Below I will go over some clarifications, a few reasons the jumps between the corps might be asymmetric, and the implications of high speedups from AutomatedCorp.

Clarifications

There are a few potentially important details which aren't clear in the analogy, written in the context of the jump from NormalCorp to AutomatedCorp:

Asymmetries

Why would there be any particular reason why the current regime was special such that scaling up labor (including quality and speed) is highly asymmetric from scaling down labor?

Here I'll cover asymmetries between the jumps from SlowCorp to NormalCorp and NormalCorp to AutomatedCorp.

There are some reasons you might eventually see asymmetry between improving vs. degrading labor quality, speed, and quantity. In particular, in some extreme limit you might e.g. just figure out the best experiments to run from an ex-ante perspective after doing all the possibly useful theoretical work etc. But, it's very unclear where we are relative to various absolute limits and there isn't any particular reason to expect we're very close. One way to think about this is to imagine some aliens which are actually 50x slower than us and which have ML researchers/engineers only as good as our median AI researchers/engineers (while having a similar absolute amount of compute in terms of FLOP/s). These aliens could consider the exact same hypothetical, but for them, the move from NormalCorp to AutomatedCorp is very similar to our move from SlowCorp to NormalCorp. So, if we're uncertain about whether we are these slow aliens in the hypothetical, we should think the situation is symmetric and our guesses for the SlowCorp vs. NormalCorp and NormalCorp vs. AutomatedCorp multipliers should be basically the same.

(That is, if we can't do some absolute analysis of our quantity/quality/speed of labor which implies that (e.g.) returns diminish right around now or some absolute analysis of the relationship between labor and compute. Such an analysis would presumably need to be mechanistic (aka inside view) or utilize actual experiments (like I discuss in the one of the items in the list above) because analysis which just looks at reference classes (aka outside view) would apply just as well to the aliens and doesn't take into account the amount of compute we have in practice. I don't know how you'd do this mechanistic analysis reliably, though actual experiments could work.)

Implications

I've now introduced some intuition pumps with AutomatedCorp, NormalCorp, and SlowCorp. Why do I think these intuition pumps are useful? I think the biggest crux about the plausibility of a bunch of faster AI progress due to AI automation of AI R&D is how much acceleration you'd see in something like the AutomatedCorp scenario (relative to the NormalCorp scenario). This doesn't have to be the crux: you could think the initial acceleration is high, but that this progress will very quickly slow due to diminishing returns on AI R&D effort biting harder than how much improved algorithms yield faster progress due to smarter, faster, and cheaper AI researchers which can accelerate things further. But, I think it is somewhat hard for the returns (and other factors) to look so bad that we won't at least have the equivalent of 3 years of overall AI progress (not just algorithms) within 1 year of seeing AIs matching the description of AutomatedCorp if we condition on these AIs yielding an AI R&D acceleration multiplier of >20x.[7]

Another potential crux for downstream implications is how big of a deal >4 years of overall AI progress is. Notably, if we see 4 year timelines (e.g. to the level of AIs I've discussed), then 4 years of AI progress brought us from the systems we have now (e.g. o3) to full AI R&D automation, so another 4 years of progress feels intuitively very large.[8] Also, if we see higher returns to some period of AI progress (in terms of ability to accelerate AI R&D), then this makes a super-exponential loop where smarter AIs build ever smarter AI systems faster and faster [LW · GW] more likely. Overall, shorter timelines tend to imply faster takeoff (at least evidentially, the causal story is much more complex). I think sometimes disagreements about takeoff would be resolved if we condition on timelines and what the run up to a given level of capability looks like, because the disagreement is really about the returns to a given amount of AI progress.


  1. These employees were the best that NormalCorp could find while hiring aggressively over a few years as well as a smaller core of more experienced researchers and engineers (around 300) who've worked in AI for longer. They have some number of the best employees working in AI (perhaps they have 1/5 of the best 1000 people on earth), but most of their employees are more like typical tech employees: what NormalCorp could hire in a few years with high salaries and an aim to recruit rapidly. ↩︎

  2. And below median, but that shouldn't have as big of an effect as removing the above median employees. ↩︎

  3. These employees were the best that NormalCorp could find while hiring aggressively over a few years as well as a smaller core of more experienced researchers and engineers (around 300) who've worked in AI for longer. They have some number of the best employees working in AI (perhaps they have 1/5 of the best 1000 people on earth), but most of their employees are more like typical tech employees: what NormalCorp could hire in a few years with high salaries and an aim to recruit rapidly. ↩︎

  4. Roughly 1.5-3x smaller than OpenAI's current computational resources ↩︎

  5. These are basically just the estimates for the number of copies and speed at the point of superhuman AI researchers in AI 2027, but I get similar numbers if I do the estimate myself. Note that (at least for my estimates) the 50x speed includes accounting for AIs working 24/7 (a factor of 3) and being better at coordinating and sharing state with weaker models so they can easily complete some tasks faster. It's plausible that heavy inference time compute use implies that we'll initially have a smaller number of slower AI researchers, but we should still expect that quantity and speed will quickly increase after this is initially achieved. So, you can think about this scenario as being what happens after allowing for some time for costs to drop. This scenario occurring a bit after initial automation doesn't massively alter the bottom line takeaways. (That said, if inference time compute allows for greatly boosting capabilities, then at the time when we have huge numbers of fast AI researchers matching the best humans, we might also be able to run a smaller number of researchers which are substantially qualitatively superhuman.) ↩︎

  6. Interestingly, this implies that AI runtime compute use is comparable to human. Producing a second of cognition from a human takes perhaps 1e14 to 1e15 FLOP [AF · GW] or between 1/10 to 1 H100 seconds. We're imagining that AI inference takes 1/5 of an H100 second to produce a second of cognition. While inference requirements are similar in this scenario, I'm imagining that training requirements start substantially higher than human lifetime FLOP. (I'm imagining the AI was trained for roughly 1e28 flop while human lifetime FLOP is more like 1e24.) This seems roughly right as I think we should expect faster inference but bigger training requirements, at least after a bit of adaptation time etc., based on how historical AI progress goes. But this is not super clear cut. ↩︎

  7. And we condition on reaching this level of capability prior to 2032 so that it is easier to understand the relevant regime, and on the relevant AI company going full steam ahead without external blockers. ↩︎

  8. The picture is a bit messy because I expect AI progress will start slowing due to slowed compute scaling by around 2030 or so (if we don't achieve very impressive AI by this point). This is partially due to continued compute scaling requiring very extreme quantities of investment by this point [LW · GW] and partially due to fab capacity running out as ML chips eat up a larger and larger share of fab capacity. In such a regime, I expect a somewhat higher fraction of the progress will be algorithmic (rather than from scaling compute or from finding additional data), though not by that much as algorithmic progress is driven by additional compute instead of additional data. Also, the rate of algorithmic progress will be slower at an absolute level. So, 20x faster algorithmic progress will yield a higher overall progress multiplier, but progress will also be generally slower. So, you'll maybe get a lower number of 2024-equivalent years of progress, but a higher number of 2031-equivalent years of progress. ↩︎

23 comments

Comments sorted by top scores.

comment by Tom Davidson · 2025-05-09T21:40:15.867Z · LW(p) · GW(p)

However, I'm quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither input is bottlenecking) 

 

Seems like this is a strawman of the bottlenecks view, which would say that the number of near frontier experiments, not compute, is the bottleneck and this quantity didn't scale up over that time

ETA: for example, if the compute scale up had happened, but no one had been allowed to run experiments with more compute than AlexNet, it seems a lot more plausible that the compute would have stopped helping because there just wouldn't have been enough people to plan the experiments

Plus the claim that alg progress might have been actively enabled by the access to new hardware scales

Replies from: ryan_greenblatt, ryan_greenblatt
comment by ryan_greenblatt · 2025-05-09T22:39:18.356Z · LW(p) · GW(p)

Seems like this is a strawman of the bottlenecks view, which would say that the number of near frontier experiments, not compute, is the bottleneck and this quantity didn't scale up over that time

Hmm, I mostly feel like I don't understand this view well enough to address it. Maybe I'll try to understand it better in the future.

(Also, I think I haven't seen anyone articulate this view other than you in a comment responding to me earlier, so I didn't think this exact perspective was that important to address. Edit: maybe we talked about this view in person at some point? Not sure.)

My current low confidence takes:

This view would imply that experiments at substantially smaller (but absolutely large) scale don't generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.

An alternative option is to just reduce the frontier scale with AIs: you decide on what training run scale you're going run and going to optimize such that you can run many experiments near that scale. Presumably it will still be strictly better to scale up the compute to the extent you can, but maybe you wouldn't be seeing the full returns of this compute because you optimized at smaller scale. So, the view would also have to be that the returns diminish fast enough that optimizing a smaller scale doesn't resolve this issue. (Concretely the AI researchers in AutomatedCorp could target a roughly 10^25 FLOP training run which would mean they'd be giving up maybe 3 OOMs of training FLOP supposing timelines in the next 5 years or so. This is a bit over 4 years of algorithmic progress they'd be giving up which doesn't seem that bad?)

I wonder what biology says about this. I'd naively guess that brain improvements on rats generalized pretty well to humans, though we did eventually saturate on these improvements? Obviously very unsure, but maybe someone knows.

Replies from: tom-davidson-1
comment by Tom Davidson (tom-davidson-1) · 2025-05-12T10:14:06.463Z · LW(p) · GW(p)

This view would imply that experiments at substantially smaller (but absolutely large) scale don't generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.

 

Agree this is an implication. (It's an implication of any view where compute can be a hard bottleneck -- past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)

 

But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren't smart enough to learn any of this and your experiments have 0 signal.

 

An alternative option is to just reduce the frontier scale with AIs

Yeah I think this is a plausible strategy.  If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.

 

Also, I think I haven't seen anyone articulate this view other than you in a comment responding to me earlier, so I didn't think this exact perspective was that important to address.

Well unfortunately the people actively defending the view that compute will be a bottleneck haven't been specific about what the think the functional form is. They've just said vague things like "compute for experiments is a bottleneck". In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about "7 OOMs", but I don't think anyone's said that model represents their view than the 'near frontier experiments' model.

comment by ryan_greenblatt · 2025-05-09T22:55:34.867Z · LW(p) · GW(p)

ETA: for example, if the compute scale up had happened, but no one had been allowed to run experiments with more compute than AlexNet, it seems a lot more plausible that the compute would have stopped helping because there just wouldn't have been enough people to plan the experiments

Hmm, I'm not sure I buy the analogy here. Can't people just run parametric experiments at smaller scale? E.g., search over a really big space, do evolution style stuff, etc?

At a more basic level, I think the relevant "frontier scale" wasn't varying over the 7 OOMs of compute difference as the algorithmic progress keeps multiplying through the relevant scales and AI companies are ultimately trying to build AGI at whatever scale it takes right? Like I think the view would have to be that "frontier scale" varied along with the 7 OOMs of compute difference, but I'm not sure I buy this.

Replies from: tom-davidson-1, cfoster0
comment by Tom Davidson (tom-davidson-1) · 2025-05-12T10:17:10.051Z · LW(p) · GW(p)

Hmm, I'm not sure I buy the analogy here. Can't people just run parametric experiments at smaller scale? E.g., search over a really big space, do evolution style stuff, etc?

Yeah agree parametric/evolution stuff changes things.

But if you couldn't do that stuff, do you agree cognitive labour would plausibly have been a hard bottleneck?

If so, that does seem analogous to if we scale up cognitive labour by 3 OOMs. After all, i'm not sure what the analogue of "parametric experiments" is when you have abundant cognitive labour and limited compute.

comment by cfoster0 · 2025-05-09T23:33:33.581Z · LW(p) · GW(p)

Like I think the view would have to be that "frontier scale" varied along with the 7 OOMs of compute difference, but I'm not sure I buy this.

Wait, why not? I’d expect that the compute required for frontier-relevant experimentation has scaled with larger frontier training runs.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-05-09T23:42:25.603Z · LW(p) · GW(p)

What is frontier scale and why is this a property that varies over time? Like I care about algorithmic improvement relevant to milestones like automated AI R&D and beyond, so I don't see why the current amount of compute people use for training is especially relevant beyond its closeness to the ultimate level of compute.

Replies from: cfoster0
comment by cfoster0 · 2025-05-10T07:08:30.183Z · LW(p) · GW(p)

Researchers have had (and even published!) tons of ideas that looked promising for smaller tasks and smaller budgets but then failed to provide gains—or hurt more than they help—at larger scales, when combined with their existing stuff. That’s why frontier AI developers “prove out” new stuff in settings that are close to the one they actually care about. [1]

Here’s an excerpt from Dwarkesh’s interview with Sholto and Trenton, where they allude to this:

Sholto Douglas 00:40:32

So concretely, what does a day look like? I think the most important part to illustrate is this cycle of coming up with an idea, proving it out at different points in scale, and interpreting and understanding what goes wrong. I think most people would be surprised to learn just how much goes into interpreting and understanding what goes wrong.

People have long lists of ideas that they want to try. Not every idea that you think should work, will work. Trying to understand why that is is quite difficult and working out what exactly you need to do to interrogate it. So a lot of it is introspection about what's going on. It's not pumping out thousands and thousands and thousands of lines of code. It's not the difficulty in coming up with ideas. Many people have a long list of ideas that they want to try, but paring that down and shot calling, under very imperfect information, what are the right ideas to explore further is really hard.

Dwarkesh Patel 00:41:32

What do you mean by imperfect information? Are these early experiments? What is the information?

Sholto Douglas 00:41:40

Demis mentioned this in his podcast. It's like the GPT-4 paper where you have scaling law increments. You can see in the GPT-4 paper, they have a bunch of dots, right?

They say we can estimate the performance of our final model using all of these dots and there's a nice curve that flows through them. And Demis mentioned that we do this process of scaling up.

Concretely, why is that imperfect information? It’s because you never actually know if the trend will hold. For certain architectures the trend has held really well. And for certain changes, it's held really well. But that isn't always the case. And things which can help at smaller scales can actually hurt at larger scales. You have to make guesses based on what the trend lines look like and based on your intuitive feeling of what’s actually something that's going to matter, particularly for those which help with the small scale.

Dwarkesh Patel 00:42:35

That's interesting to consider. For every chart you see in a release paper or technical report that shows that smooth curve, there's a graveyard of first few runs and then it's flat.

Sholto Douglas 00:42:45

Yeah. There's all these other lines that go in different directions. You just tail off.

[…]

Sholto Douglas 00:51:13

So one of the strategic decisions that every pre-training team has to make is exactly what amount of compute do you allocate to different training runs, to your research program versus scaling the last best thing that you landed on. They're all trying to arrive at an optimal point here. One of the reasons why you need to still keep training big models is that you get information there that you don't get otherwise. So scale has all these emergent properties which you want to understand better.

Remember what I said before about not being sure what's going to fall off the curve. If you keep doing research in this regime and keep on getting more and more compute efficient, you may have actually gone off the path to actually eventually scale. So you need to constantly be investing in doing big runs too, at the frontier of what you sort of expect to work.

[1] Unfortunately, not being a frontier AI company employee, I lack first-hand evidence and concrete numbers for this. But my guess would be that new algorithms used in training are typically proved out within 2 OOM of the final compute scale.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-05-11T17:02:03.587Z · LW(p) · GW(p)

Sure, but worth noting that a strong version of this view also implies that all algorithmic progress to date has no relevance to powerful AI (at least if powerful AI trained with 1-2 OOMs more compute than current frontier models).

Like, this view must implicitly think that there is a different good being produced over time, rather than thinking there is a single good "algorithmic progress" which takes in inputs "frontier scale experiments" and "labor" (because frontier scale isn't a property that exists in isolation).

This is at least somewhat true as algorithmic progress often doesn't transfer (as you note), but presumably isn't totally true as people still use batch norm, MoE, transformers, etc.

Replies from: cfoster0
comment by cfoster0 · 2025-05-12T04:43:48.346Z · LW(p) · GW(p)

Yes, I think that what it takes to advance the AI capability frontier has changed significantly over time, and I expect this to continue. That said, I don’t think that existing algorithmic progress is irrelevant to powerful AI. The gains accumulate, even though we need increasing resources to keep them coming.

AFAICT, it is not unusual for productivity models to account for stuff like this. Jones (1995) includes it in his semi-endogenous growth model where, as useful innovations are accumulated, the rate at which each unit of R&D effort accumulates more is diminished. That paper claims that it was already known in the literature as a “fishing out” effect.

comment by Rohin Shah (rohinmshah) · 2025-05-12T09:57:47.311Z · LW(p) · GW(p)

You might expect the labor force of NormalCorp to be roughly in equilibrium where they gain equally from spending more on compute as they gain from spending on salaries (to get more/better employees).

[...]

However, I'm quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither input is bottlenecking) and AI companies effectively can't pay more to get faster or much better employees, so we're not at a particularly privileged point in human AI R&D capabilities.

SlowCorp has 625K H100s per researcher. What do you even do with that much compute if you drop it into this world? Is every researcher just sweeping hyperparameters on the biggest pretraining runs? I'd normally say "scale up pretraining another factor of 100" and then expect that SlowCorp could plausibly outperform NormalCorp, except you've limited them to 1 week and a similar amount of total compute, so they don't even have that option (and in fact they can't even run normal pretraining runs, since those take longer than 1 week to complete).

The quality and amount of labor isn't the primary problem here. The problem is that the current practices for AI development are specialized to the current labor:compute ratio, and can't just be changed on a dime if you drastically change the ratio. Sure, the compute input has varied massively over 7 OOMs; importantly this did not happen all at once, the ecosystem adapted to it. 

SlowCorp would be in a much better position if it was in a world where AI development had evolved with these kinds of bottlenecks existing all along. Frontier pretraining runs would be massively more parallel, and would complete in a day. There would be dramatically more investment in automation of hyperparameter sweeps and scaling analyses, rather than depending on human labor to do that. The inference-time compute paradigm would have started 1-2 years earlier, and would be significantly more mature. How fast would AI progress be in that world if you are SlowCorp? I agree it would still be slower than current AI progress, but it is really hard to guess how much slower, and it's definitely drastically faster than if you just impute a SlowCorp in today's world (which mostly seems like it will flounder and die immediately).

So we can break down the impacts into two categories:

  1. SlowCorp is slower because of less access to resources. This is the opposite for AutomatedCorp, so you'd expect it to be correspondingly faster.
  2. SlowCorp is slower because AI development is specialized to the current labor:compute ratio. This is not the opposite for AutomatedCorp, if anything it will also slow down AutomatedCorp (but in practice it probably doesn't affect AutomatedCorp since there is so much serial labor for AutomatedCorp to fix the issue).

If you want to pump your intuition for what AutomatedCorp should be capable of, the relevant SlowCorp is the one that only faces the first problem, that is, you want to consider the SlowCorp that evolved in a world with those constraints in place all along, not the SlowCorp thrown into a research ecosystem not designed for the constraints it faces. Personally, once I try to imagine that I just run into a wall of "who even knows what that world looks like" and fail to have my intuition pumped.

comment by faul_sname · 2025-05-09T17:18:59.632Z · LW(p) · GW(p)

I have very different intuitions about 50M GPUs for 1 week vs 200k GPUs for with 200 hours of work spread evenly across 50 years.

 SlowCorp
v1
SlowCorp
v2
NormalCorp
v1
NormalCorp
v2
AutomatedCorp
Time to work on AI R&D50 years50 years50 years50 years50 years
Number of AI researchers and engineers8008004,0004,000200,000
Researcher/engineer qualityMedian frontier AI company researcher/engineerMedian frontier AI company researcher/engineerSimilar to current frontier AI companies if they expanded rapidlySimilar to current frontier AI companies if they expanded rapidlyLevel of world's 100 best researchers/engineers
Time workedOne week of 24/7 work (or four weeks at 40h / week but the GPUs are paused while the workers aren't working)50 years of one 4 hour session per yearOne year of 24/7 (or four years of 40h/week but the GPUs are paused while the workers aren't working)50 years of 40 hours / week for 1 month per year50 years of 24/7
H100s500,000,000200,00010,000,000200,000200,000
Cumulative H100-years10 million10 million10 million10 million10 million

I think SlowCorp-v2 would get a lot more done than SlowCorp-v1 (though obviously still a lot less than AutomatedCorp). And also SlowCorp-v2 seems to be a closer analogy than SlowCorp-v1 - both corporations have the same amount of serial time, and my intuition is that you generally can't make a training run go 10x faster just by throwing 10x as many GPUs at it, because you'll be bottlenecked by IO.

And I know "SlowCorp is bottlenecked by IO" is not what the point of this intuition pump was supposed to be, but at least for me, it ended up being the main consideration pumping my intuition.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-05-09T17:59:46.880Z · LW(p) · GW(p)

Yeah, I discuss this here:

The way I set up the analogy makes it seem like AutomatedCorp has a serial compute advantage: because they have 50 years they can run things that take many serial years while NormalCorp can't. As in, the exact analogy implies that they could use a tenth of their serial time to run a 5 year long training run on 50k H100s, while they could actually only do this if the run was sufficiently parallelizable such that it could be done on 2.5 million H100s in a tenth of a year. So, you should ignore any serial compute advantage. Similarly, you should ignore difficulties that SlowCorp might have in parallelizing things sufficiently etc.

You can also imagine that SlowCorp has 10 million magically good GPUs (and CPUs etc) which are like H100s but 50x serially faster (but still only has 1 week) while AutomatedCorp has 10 million much worse versions of H100s (and CPUs etc) which are 50x serially slower but otherwise the same (and has 50 years still).

Replies from: faul_sname
comment by faul_sname · 2025-05-09T18:29:53.389Z · LW(p) · GW(p)

Also SlowCorp has magically 50x better networking equipment than NormalCorp, and 50x higher rate limits on every site they're trying to scrape, and 50x as much sensor data from any process in the world, and 50x faster shipping on any physical components they need, etc etc (and AutomatedCorp has magically 50x worse of all of those things).

But yeah, agreed that you should ignore all of those intuitions when considering the "1 week" scenario - I just found that I couldn't actually turn all of those intuitions off when considering the scenario.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-05-09T19:09:45.527Z · LW(p) · GW(p)

Yep, but my understanding is that the time associated with marginal scraping, sensor data, and physical components don't matter much when talking about AI progress which is on the order of a year. Or honestly, maybe marginal improvements in these sorts of components don't matter that much at all over this time scale (like freezing all these things for a year wouldn't be much tax if you prepped in advance). Not super sure about situation with scrapping though.

comment by Jonas Hallgren · 2025-05-10T09:35:35.613Z · LW(p) · GW(p)

I would overall be more convinced of this view if you could involve some philosophy of science or work from metascience such as Michael Nielsen's work or similar.

I get the intuition pump and I totally understand that if this frame of reference is correct, then FOOM is what you have to accept yet I disagree with the framing itself. I'll do my best to point out why below by asking a bunch of questions in the general area: 

How are you going to ensure that the collective intelligence of AI agents are doing good exploration? How are you ensuring that you're not working within a static regime that is implictly defined by what you're not thinking of? How will this system discover its own cognitive black swans? 

Are you encounting for the computational costs that actually updating your internal models based on perceived KL-Divergence gives you? What about the average level of specialization in the agents? How are you going to leverage that optimally?

How are you actually going to design this system in a way that does online learning or similar in a computationally efficient way? If you're not doing online learning, what is the paradigm that allows for novel questions to be asked? 

How will this then translate to new improvements? What is the chain to hardware improvements? Software improvements? What specific algorithms do you think will be sped up and how will they be implemented back into the system? 

Could you detail how more theoretical and conceptual exploration happens within the system itself? I guess I just don't believe in the high levels of parallelism being easy to get to and that the system being slower, choppy and phase-shift like in its properties. 

Of course some of these might be infohazards so you probably shouldn't answer all of the literal questions.

I will also note that I might be completely off with the above questions and that all you have to do is to run some specific algorithms that can discover organisational principles based on optimal collective learning. Yet, how do you run an algorithm to discover what algorithm to run? This seems like a philosophy of science question to me and it forms the basis for the skepticism I still have.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-05-10T13:00:43.157Z · LW(p) · GW(p)

I totally understand that if this frame of reference is correct, then FOOM is what you have to accept

I'd say something a bit weaker, that if you expect large acceleration in this scenario (>15x), then software intelligence explosion looks likely. And this is one of the biggest cruxes in practice.


I would overall be more convinced of this view if you could involve some philosophy of science or work from metascience such as Michael Nielsen's work or similar.

Is there any reason why this set of objections applies more to AIs than to humans? It sounds like you are rejecting the premise of having AIs which are as good as best human resarchers/engineers. I agree that these factors slow human AI R&D progress at AI companies, but given the condition of having AIs which are this capable, I don't see why you'd expect them to bite harder for the AIs (maybe because the AI run organization is somewhat bigger?). If anything I'd guess that (if you accept the premise) than AIs will be better at overcoming these obstacles due to better reproducibility, willingness to run methodology/metascience experiments on themselves, and better coordination.

All that said, I agree that a potentially key factor is that the AI capability profile might be importantly weaker than humans in important ways at the time of first having full automation (but these difficulties are overcome by AI advantages like narrow super humanness, speed, vast knowledge, coordination, etc). (Note that the scenario in the post is not necessarily talking about the exact moment you first have full automation.) I think this could result in full automation with AIs which are less generally smart, less good at noticing patterns, and less cognitively flexible. So these AIs might be differentially hit by these issues. Nonetheless, there is still the question of how hard it would be for further reasearch to make these AI weaknesses relative to humans go away.

Replies from: Jonas Hallgren
comment by Jonas Hallgren · 2025-05-13T07:21:55.989Z · LW(p) · GW(p)

So if I'm looking for black swans in your model or hidden corners of it, this is where I would look if that makes sense? 

Is there any reason why this set of objections applies more to AIs than to humans? It sounds like you are rejecting the premise of having AIs which are as good as best human resarchers/engineers

I believe so, I believe that if you look at Michael Levin's work he has this well put concept of a very efficient memorization algorithm that maps all past data into a very small bandwidth that is then mapped on to a large future cognitive lightcone. Algorithmically the main benefit that biological systems have is very efficient re-sampling algorithms, basically the only way to do this for a human is to be able to frame-shift and so we have a large optimisation pressure for frame-shifting. 

The way that training algorithms currently work seem to be pointing towards a direction where this capacity is a lot more weakly optimised for. 

If we look at the psychology literature on creative thinking, they often divide things up into convergent and divergent thinking. We also have the division between selection and generation and I think that the specific capacity of divergent selective thinking is dependent on frame-shifting and I think this is the difficult skill of "rejecting or accepting the frame". 

I think science of philosophy agrees with this and so I can agree with you that we will see a large speed up but amdahl's law and all that so if selective divergent thinking is already the bottleneck will AI systems really speed things up that much? 

(I believe that the really hard problems are within the divergent selective camp as they're often related to larger conceptual questions.)

comment by Davidmanheim · 2025-05-10T19:49:59.393Z · LW(p) · GW(p)

This seems mostly right, except that it's often hard to parallelize work and manage large projects - which seems like it slows thing importantly. And, of course, some things are strongly serialized using time that can't be sped up via more compute or more people. (See: PM hires 9 women to have baby in one month.)

Similarly, running 1,000 AI research groups in parallel might get you the same 20 insights 50 times, rather than generating far more insights. And managing and integrating the research, and deciding where to allocate research time, plausibly gets harder at more than a linear rate with more groups.

So overall, the model seems correct, but I think the 10x speed up is more likely than the 20x speed up.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-05-11T01:16:23.423Z · LW(p) · GW(p)

I agree parallelization penalties might bite hard in practice. But it's worth noting that the AIs in the AutomatedCorp hypothetical also run 50x faster and are more capable.

(A strong marginal parallelization penalty exponent of 0.4 would render the 50x additional workers equivalent to a 5x improvement in labor speed, substantially smaller than the 50x speed improvement.)

Replies from: gwern
comment by gwern · 2025-05-11T02:20:30.752Z · LW(p) · GW(p)

Maybe it would be helpful to start using some toy models of DAGs/tech trees to get an idea of how wide/deep ratios affect the relevant speedups. It sounds like so far that much of this is just people having warring intuitions about 'no, the tree is deep and narrow and so slowing down/speeding up workers doesn't have that much effect because Amdahl's law so I handwave it at ~1x speed' vs 'no, I think it's wide and lots of work-arounds to any slow node if you can pay for the compute to bypass them and I will handwave it at 5x speed'.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2025-05-11T16:57:16.217Z · LW(p) · GW(p)

This isn't that important, but I think the idea of using an exponential parallelization penalty is common in the economics literature. I specifically used 0.4 as around the harshest penalty I've heard of. I believe this number comes from some studies on software engineering where they found something like this.

I'm currently skeptical that toy models of DAGs/tech trees will add much value over:

  • Looking at how parallelized AI R&D is right now.
  • Looking at what people typically find in the economics literature.

(Separately AIs might be notably better at coordinating than humans are which might change things substantially. Toy models of this might be helpful.)

comment by Mis-Understandings (robert-k) · 2025-05-16T19:17:33.568Z · LW(p) · GW(p)

Random Thought. Automated Corp has a real advantage. That is, inference and training can run on the same GPUs, (to a first approximation). So for slow corp, if they spend a day deciding, and don't commit to runs, they are wasting GPU time. But the other corps don't have this problem. It is a big problem. There is something there, the real question is (How much does thinking about the results of your test queue improve the informational value of the tests you run).