Forecasting time to automated superhuman coders [AI 2027 Timelines Forecast]

post by elifland, Nikola Jurkovic (nikolaisalreadytaken) · 2025-04-10T23:10:23.063Z · LW · GW · 0 comments

This is a link post for https://ai-2027.com/research/timelines-forecast

Contents

  Summary
  Defining a superhuman coder (SC)
  Method 1: Time horizon extension
      METR’s time horizon report
      Forecasting SC’s arrival
  Method 2: Benchmarks and gaps
    Time to RE-Bench saturation
      Why RE-Bench?
      Forecasting saturation via extrapolation
      AI progress speedups after saturation
    Time to cross gaps between RE-Bench saturation and SC
      What are the gaps in task difficulty between RE-Bench saturation and SC?
      Methodology
      How fast can the task difficulty gaps be crossed?
        Summary table
        Time horizon
    Other factors for benchmarks and gaps
      Compute scaling and algorithmic progress slowdown
      Gap between internal and external deployment
      Intermediate speedups
    Overall benchmarks and gaps forecasts
  Appendix
    Individual Forecaster Views for Benchmark-Gap Model Factors
      Engineering complexity: handling complex codebases
      Feedback loops: Working without externally provided feedback
      Parallel projects: Handling several interacting projects
      Specialization: Specializing in skills specific to frontier AI development
      Cost and speed
      Other task difficulty gaps
    Superhuman Coder (SC): time horizon and reliability requirements
    RE-Bench saturation resolution criteria
None
No comments

Authors: Eli Lifland,[1] Nikola Jurkovic,[2] FutureSearch[3]

This is supporting research for AI 2027. We'll be cross-posting these over the next week or so.

Assumes no large-scale catastrophes happen (e.g., a solar flare, a pandemic, nuclear war), no government or self-imposed slowdown, and no significant supply chain disruptions. All forecasts give a substantial chance of superhuman coding arriving in 2027.

Summary

We forecast when the leading AGI company will internally develop a superhuman coder (SC): an AI system that can do any coding tasks that the best AGI company engineer does, while being much faster and cheaper. At this point, the SC will likely speed up AI progress substantially as is explored in our takeoff forecast.

We first show Method 1: time-horizon-extension, a relatively simple model which forecasts when SC will arrive by extending the trend established by METR’s report of AIs accomplishing tasks that take humans increasing amounts of time.

We then present Method 2: benchmarks-and-gaps, a more complex model starting from a forecast saturation of an AI R&D benchmark (RE-Bench), and then how long it will take to go from that system to one that can handle real-world tasks at the best AGI company.

Finally we provide an “all-things-considered” forecast that takes into account these two models, as well as other possible influences such as geopolitics and macroeconomics.

We also solicited forecasts from 3 professional forecasters from FutureSearch (bios here).

Each method’s results are summarized below:

 Eli’s SC forecast (median, 80% CI)Nikola’s SC forecast (80% CI)FutureSearch aggregate (80% CI) (n=3)
Time-horizon-extension model2027 (2025 to 2039)2027 (2025 to 2033)N/A
Benchmarks-and-gaps model2028 (2025 to >2050)2027 (2025 to 2044)2032 (2026 to >2050)
All-things-considered forecast, adjusting for factors outside these models2030 (2026 to >2050)2028 (2026 to 2040)2033 (2027 to >2050)

All model-based forecasts have 2027 as one of the most likely years, which is when an SC is achieved in the AI 2027 scenario. The code for our simulation is here.

Defining a superhuman coder (SC)

Superhuman coder (SC): An AI system for which the company could run with 5%[4] of their compute budget 30x as many agents as they have human research engineers, each of which is on average accomplishing coding tasks involved in AI research (e.g. experiment implementation but not ideation/prioritization) at 30x the speed (i.e. the tasks take them 30x less time, not necessarily that they write or “think” at 30x the speed of humans) of the company’s best engineer. This includes being able to accomplish tasks that are in any human researchers’ area of expertise.

Nikola and Eli estimate that the first SC will have at least 50th percentile frontier AI researcher “research taste” as well, but that isn’t required in the definition.

Method 1: Time horizon extension

This model relies upon the extrapolation of the progression of AIs toward being superhuman coders (SCs), as measured by how long it takes humans to do the hardest tasks that the AIs can do (which we call the AIs’ “time horizon”). We heavily draw from METR’s recent report which catalogues a trend of increasing time horizon (pictured below).

We split our forecast into 2 subquestions:

  1. What time horizon and reliability level on METR’s task suite are needed for SC?
  2. When will this time horizon and reliability be reached?. This is broken down into:
    1. The current doubling time of the time horizon
    2. How this would change over time, with no AI R&D automation
    3. The difficulty of making a human-cost SC 30x faster and cheaper
    4. Accounting for intermediate speedups and the internal-public gap

The results of our simulation are below.

Our distributions accounting for factors outside of this model are wider.

METR’s time horizon report

METR’s recent report measures the “time horizon” capability of AI systems, where time horizon is defined based on how long it takes a skilled human to complete tasks (more details in footnote).[5]

An AI with an R% time horizon of T time means that it has an average success rate of R% on tasks that take humans T time. We follow their definitions of time horizon and reliability in our modeling, except we add a constraint that the AI must complete the task at least as quickly and cheaply as humans. This wouldn’t change METR’s results given that they didn’t spend human-parity costs on inference compute.

The below figure illustrates the methodology:

More details about their HCAST task suite are in this paper, with the below table illustrating the distribution of tasks:

From here on I’ll refer to the METR task suite as HCAST for brevity, given that we’ll be discussing time horizons well above those that SWAA measures and RE-Bench is a small subset of the suite.

Forecasting SC’s arrival

We outline our simulation parameters in the following table.

 Estimates (80% CI of lognormal unless stated otherwise)Reasoning
Current 80% time horizon15 minutes (point estimate)

Taken from METR’s time horizon paper. This is Claude Sonnet 3.7’s 80% time horizon.

 

Time horizon required for SC

Eli: 10 years [1 month, 1200 years]

Nikola: 1.5 months [16 hours, 2 work-years (4,000 hours)]

Time horizon required on real distribution of work tasks, as baselined by the best humans with strong incentives: 6 months (80% CI: [1 week, 12 years]).

Time horizon and reliability required on an extrapolation of HCAST, with METR’s current baselining strategy: 10 years [1 month, 1200 years]

More reasoning in the appendix.

Time horizon doubling time as of Mar 2025 on HCAST

4.5 months [2.5 months, 9 months]

 

 

Per: METR’s report, the doubling time for 50% time horizon has been roughly:

  1. For their task suite:
    1. 2019-2025 period: 7 months (Figure 1)
    2. 2024 onward: 3.5 months (Figure 19) (few data points)
  2. For SWEBench-Verified beginning in late 2023: 2.5 months (Figure 11)

 

For the 80% time horizon the doubling time is about the same as 50% during 2019-2025 (7.5 instead of 7 months, Figure 6).

 

Weighing up the above gives us a median of about 4.5 months. The trends over longer time periods are the most robust, but the latest trends are faster.

Will doubling times speed up, slow, or stay the same?

Probabilities:

Exponential:

  1. Eli: 0.45
  2. Nikola: 0.5

Superexponential:

  1. Eli: 0.45
  2. Nikola: 0.4

Subexponential:

  1. Eli: 0.1
  2. Nikola: 0.1

It's possible that time horizon increases superexponentially over time between now and the level required for SC: i.e. it takes less AI progress to go from 1 month to 2 months than from 1 hour to 2 hours, since long-horizon reasoning easily generalizes from short to long time horizons.

 

Reasons in favor of superexponentiality includes:

  1. Empirical: The METR report finds a 3.5 month doubling time for 2024-2025, compared to a 7 month doubling time for 2019-2025. This is based on few data points. Scaling up agency training provides a potential reason for the trend, as discussed in Section 7.2.2 of the report.[6]

  2. Conceptual: It seems like for humans the gap in difficulty between 1 month and 2 month tasks is lower than between 1 day and 2 days. It’s unclear whether this will transfer to AIs though, given that thus far relative to humans they have solved tasks more strongly with knowledge than with general reasoning.[7]

 

Therefore we assign a significant probability to growth being superexponential. We also assign smaller weight to the trend being subexponential.[8] If the growth is superexponential, we make it so that each successive doubling takes 10% less time. If it’s subexponential, each successive doubling takes 10% more time.

Cost and speed adjustment4 months [0.5 months, 30 months]

Being an SC requires accomplishing tasks 30x faster and 30x cheaper than the best human researchers. However, in the existing METR evaluations they aren’t spending up to human cost, so our starting price point is below humans.

 

Eyeballing Figure 13 from the METR report, the AIs are currently about 30x cheaper in the median case for HCAST tasks, and perhaps 5-10x cheaper on average. Analysis of their data shows that AIs are roughly 5x faster on average.

 

Below, we forecast in some depth how fast AIs will get 30x faster and cheaper, starting at human level, based on historical trends in price decreases. Here we take that forecast of 6.9 [1, 48] months, and adjust it downwards by about 50% since we’re starting at 5-10x cheaper and faster than humans.

Gap between internal and external deployment[0.25 months, 6 months]The current time horizon estimate is for public models, but it is possible that companies have more capable models internally. See more below.

Method 2: Benchmarks and gaps

Time to RE-Bench saturation

Why RE-Bench?

RE-Bench is a set of challenging and realistic AI R&D tasks with objective scoring functions. They aim to capture the types of work that are commonly done by engineers at AGI companies (e.g., writing scripts to train ML models, optimizing Pytorch code). Since they’re continuously scored there’s no single amount of time that they take to complete, but human baselines so far have been collected up to 8 hours, and 8 hours is sufficient time for a competent professional to make significant improvements to their score. By having humans and AI systems complete RE-Bench tasks, we can get a sense of how capable AI systems are at tasks involved in AI R&D.

We focus on a subset of 5 of the 7 RE-Bench tasks due to issues with scoring in the remaining two, and will refer to this subset as “RE-Bench” in the rest of this report. In particular, we exclude Scaling Law Experiment because it’s easy enough for models to succeed at by luck that it’s not appropriate for best-of-k scaffolding, and we exclude Restricted MLM Architecture because Claude 3.7 Sonnet reliably cheats at this task and METR has not yet been able to prompt the model to attempt the task without cheating.

RE-Bench has a few nice properties that are hard to find in other benchmarks and which make it a uniquely good measure of how much AI is speeding up AI research:

  1. Highly relevant to frontier AI R&D.
  2. High performance ceiling. AI agents can achieve significantly above human-level, though in practice it will likely be very difficult to do more than roughly 2x higher than the human baseline solutions (for a score of 1.5). Median human baseline scores are 0.12 for 2 hours of effort and 0.66 for 8 hours. Current state of the art (SOTA) is Claude 3.7 Sonnet with a score of roughly 0.6 using Best-of-K scaffolding in a scaffold called “modular”.
  3. Human baselines which allow for grounded comparisons between AI and human performance.

We expect that “saturation” under this definition will happen before the SC milestone is hit. The first systems that saturate RE-Bench will probably be missing a few kinds of capabilities that are needed to hit the SC milestone, as described below.

Forecasting saturation via extrapolation

How high does RE-Bench go?

This table from the RE-Bench paper gives a sense of how high the RE-Bench score could go:

The average of the midpoint of each “estimated ceiling” above is 1.67.[9] To be conservative and make sure the “saturation” level is possible to achieve, we will define a score of 1.5 to mean “saturation” for RE-Bench tasks. We use the resolution criteria from the AI 2025 Forecasting Survey (the appendix), which includes making sure the model doesn’t cost more than a human per task. A score of 1.5 will mean the model beats more than around 95% of human baseline runs (the 90th percentile is around 1.22 in RE-Bench Figure 4). But we estimate that 1.5 is approximately at the level of the average of the best human; this is higher than the 95th percentile due to variance between tasks in difficulty of getting high scores, and variance in individuals’ performance, e.g. due to luck.

Running an extrapolation

Benchmarks have been found to often follow logistic curves [AF · GW] and we will assume RE-bench will follow a similar shape as well, fitting a logistic fit to the point estimates of the historical high score over time. We assume the lower bound of the logistic curve is 0. The upper bound of RE-Bench is not known, so we will model it as a normal distribution with a mean of 2.0 and a standard deviation of 0.25. Changing the upper bound doesn’t change the forecast much — a change of 0.25 in the upper bound moves the saturation date around half a year.

This gives this graph of peak scores allowing Best-of-K and giving most models a 16-hour time budget at the task:[10]

The 80% CI comes from uncertainty about the upper bound of the score and is not meant to represent an epistemic state.

This predicts the date of saturation to be sometime in 2026.

See also this paper which forecasts RE-Bench hitting 1 in 2027. We think the data they used likely led to overly conservative forecasts.

Overall forecasts of RE-bench saturation

Eli, FutureSearchLognormal, 80% CI of [2025-09-01, 2031-01-01].
NikolaLognormal, 80% CI of [2025-08-01, 2026-11-01]

We expect the logistic forecast to slightly overestimate the rate of progress because we now have additional information that the first quarter of 2025 has passed with no new SOTA scores on RE-Bench reported by METR.

AI progress speedups after saturation

Nikola’s current guess is that algorithmic progress is 3-30% faster with AI chatbots and copilots from the 2022-2024 period than it would be if AI researchers didn’t use them. 

Nikola expects that agents capable of saturating RE-Bench will be roughly twice as useful for productivity than 2024-era AIs, but possibly even more than that. Nikola’s best guess is that algorithmic progress will be [5%, 60%] faster when RE-Bench is first publicly saturated than it was in 2024. Nikola assumes algorithmic progress in 2024 is 50% of overall AI progress. Eli roughly agrees with these estimates.

Our best guess for what AI research capabilities look like at RE-Bench saturation is: there will exist agents that require substantial supervision when doing real-world 8-hour tasks, but which are sometimes able to do hours-long tasks with minimal human intervention. If we imagine RE-Bench saturation level AIs often doing hours-long tasks, it seems plausible that this will be a large speedup (e.g., 50% productivity increase, which after accounting for compute bottlenecks might translate to 15% algorithmic progress speedup) for many AI researchers.

Time to cross gaps between RE-Bench saturation and SC

We now turn to forecasting the time from RE-Bench being saturated to SC.

We first discuss what the main gaps are in “task difficulty” between RE-Bench saturation and SC. Then we describe our methodology for forecasting how long it will take to cross these gaps.

Then we forecast how quickly all gaps in task difficulty will be crossed.

What are the gaps in task difficulty between RE-Bench saturation and SC?

The RE-Bench paper notes four main categories of gaps between saturating RE-Bench and being able to conduct real AI R&D.

In addition to the ones in the above table, we add gaps for specialization and cost + speed; RE-Bench tasks generally require low background context, including not requiring familiarity with large codebases.

Methodology

In summary: we define milestones that would indicate each task difficulty gaps being crossed in such a way that they get strictly harder and therefore must be crossed sequentially, then estimate the number of months between each milestone and sum these up.

Our approach is:

  1. For each gap after RE-Bench saturation (which is the first “milestone”):
    1. Define a milestone that would indicate the gap being crossed, which is strictly harder from the previous milestone such that the time between the gaps must be positive. These look like “Same as above, but… [the gap has been crossed, e.g. it can do all the tasks above 30x faster and cheaper]” and can be viewed in this summary table.[11] 

      1. For all task difficulty gaps except for time horizon, after they are crossed they remain at the same level for all future milestones. For the time horizon property, it is allowed to freely increase given that it’s a general difficulty measure so cannot be held constant while a property of the task gets harder.

    2. Estimate the number of months needed to cross that gap at the 2024 rate of AI progress. We have better data for time horizon increases and cost/speed improvements than we have for the other categories, so the others are estimated much less rigorously.
  2. Sum all task difficulty gaps to get the total size of the task difficulty gap, measured in months of AI progress at the 2024 rate of AI progress.
  3. Find the time to cross all task difficulty gaps by incorporating intermediate speedups of AI progress, then add other potential slowdowns (e.g. adoption lag) and account for the gap between internal and external deployment.

All of these gaps (except for the time horizon gap, which is further modeled as explained here) are modelled as lognormals according to the 80% confidence interval, meaning that the median is always the geometric mean of the lower and upper bounds. Samples are drawn with a positive correlation because the difficulties of achieving each of the capabilities are likely correlated.[12]

How fast can the task difficulty gaps be crossed?

Summary table

Below we summarize our gap crossing forecasts. FutureSearch below refers to the aggregate of 2 professional forecasters from FutureSearch. Detailed individual rationales for each gap are in the appendix.

Gap nameMilestone that would indicate the gap being crossedPredictions for gap size (median and 80% CI)Reasoning summary
Time horizon: Achieving tasks that take humans lots of time.Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying a maximum of 10,000 lines of code across files totaling up to 20,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, add the same cost and speed as humans.

Eli: 18 [2, 144]

Nikola: 16 [1, 125]

(these aren’t lognormal as they’re simulated; see more below)

FutureSearch: 12.7 [1.7, 48]

Calculated from needed horizon length and doubling time.
Engineering complexity: Handling complex codebasesAbility to develop a wide variety of software projects involved in the AI R&D process which involve modifying >20,000 lines of code across files totaling up to >500,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, add the same cost and speed as humans.

Eli: 3 [0.5, 18]

Nikola: 3 [0.5, 18]

FutureSearch: 11 [2.4, 33.9]

 

Estimated via performance trends on METR’s time horizon task suite.
Feedback loops: working without externally provided feedbackSame as above, but without provided unit tests and only a vague high-level description of what the project should deliver.

Eli: 6 [0.8, 45]

Nikola: 3 [0.5, 18]

FutureSearch: 18.3 [1.7, 58]

 

Estimated from looking at how much removing Best-of-K sampling from RE-Bench diminishes performance.
Parallel projects: handling several interacting projectsSame as above, except working on separate projects spanning multiple codebases that interface together (e.g., a large-scale training pipeline, an experiment pipeline, and a data analysis pipeline).

Eli: 1.4 [0.5, 4]

Nikola: 1.2 [0.5, 3]

FutureSearch: 2 [0.7, 5.3]

 

Estimated as being very small due to overlap with the engineering complexity and time horizon gaps.
Specialization: Specializing in skills specific to frontier AI developmentSame as above, except working on the exact projects pursued within AGI companies.

Eli: 1.7 [0.5, 6]

Nikola: 0.4 [0.1, 2]

FutureSearch: 2.4 [0.5, 4.7]

 

 

Estimated from the fact that fine-tuning for specific use cases usually doesn’t take long, and the overlap between RE-Bench tasks and real-world coding is large.
Cost and speedSame as above, except doing it at a cost and speed such that there are substantially more superhuman AI agents than human engineers (specifically, 30x more agents than there are humans, each one accomplishing tasks 30x faster).

Eli: 6.9 [1, 48]

Nikola: 6 [1, 36]

FutureSearch: 13.5 [4.5, 36]

 

Estimated from data of AI capabilities getting cheaper over time.
Other task difficulty gapsSC achieved.

Eli: 5.5 months [1, 30]

Nikola: [3 0.5, 18]

FutureSearch: 14.7 [2, 58.8]

Accounting for unknown unknowns

Time horizon

Milestone which would indicate the gap being crossed: Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying a maximum of 10,000 lines of code across files totaling up to 20,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, add the same cost and speed as humans.

Time horizon report and definition

METR’s recent report measures the “time horizon” capability of AI systems on software engineering tasks, where time horizon is defined based on how long it takes humans to complete tasks (more details in footnote).[13]

An AI with an R% time horizon of T time means that it has an average success rate of R% on tasks that take humans T time. For more details about their methodology and their task suite HCAST, see above.

Superhuman Coder (SC): initial time horizon and reliability requirements

A superhuman coder (SC) (without speed/cost which are later taken into account) must be able to overall do as good of a job as the combination of all human programmers at an AGI company at their current work.

What time horizon and reliability level does this require? Because the time horizon will continue to increase as future gaps are crossed, we will choose an “initial time horizon" which is somewhat lower than what we think the ultimate SC time horizon will need to be.

Initial time horizon required: 1 month feels roughly right for the low end of the time requirements of difficult coding projects.[14] We’ll take into account some uncertainty here with a lognormal with 80% CI of [4 hours, 6 work-months (1000 hours)]. We roughly guess that future gap crossings will increase the time horizon to about 6 months, which seems reasonable for representing very difficult coding projects.

Reliability required: 80%, though highly uncertain. If the SC is equally well-rounded as the human researcher force, this would be somewhat below 50% for a few reasons in footnote.[15] Currently AIs are much less well-rounded than humans though, so if they have a 40% time horizon within human cost/speed they likely only go up to around 45-50% if allowed to take 10x longer. So with current AIs we might need to set a 90+% reliability threshold. SC-level AIs will be much more well-rounded though, due to having very strong agency skills (planning, correcting mistakes, etc.). So we lower it to 80%, which seems roughly right and has the advantage of being able to utilize METR’s reported data. Uncertainty is not incorporated into our model, for simplicity and because any adjustment to reliability could also be modeled as an adjustment to time horizon instead.

Engineering complexity associated with time horizon extrapolation

We’ll measure the engineering complexity of a task via 2 proxies: (a) lines of code modified and (b) sum of lines of code are in all modified files.

Saturating RE-Bench requires a median of 500 lines of code in modified files, and about 250 lines of code modified. The other 8-hour tasks in the METR time horizon suite require similar amounts.

Based on very rough data analysis, we estimate that each time horizon increase in the METR suite leads to a proportional increase in both proxies. Since there is a 20x increase between 8 hours and 1 month, this would mean an increase to about 10,000 lines of code modified, 20,000 lines of code in all modified files.

The small multiplier between lines modified and lines of code in all modified files reflects an emphasis by REBench and the METR suite on tasks requiring low context, including familiarity with large codebases.

Saturating RE-Bench: Time horizon and reliability level

As described above, we set the RE-Bench saturation based on achieving an average of 1.5, which is about what we think the best human could get given 8 hours for each task.

What would achieving this score mean in terms of reliability at an 8 hour time horizon, relative to the best human?

I think it means more than 50% because as described above, AIs’ skillsets are currently more uneven than humans, perhaps the AI will have to be better than top humans at >50% of tasks in order for the average to be the same, because there’s less room to go above 1.5 on the RE-Bench tasks than below it.

This effect seems significant but not huge. My best guess is 60% reliability.

Time horizon forecasts

 Estimates (80% CI of lognormal)Reasoning
80% time horizon required for our initial milestone

Eli: [8 hours, 6 work-months (1000 hours]

 

Nikola: [8 hours, 6 work-months (1000 hours)]

 

 

Because the time horizon will continue to increase as future gaps are crossed, we will choose an “initial time horizon" which is somewhat lower than what we think the ultimate SC time horizon will need to be. A median of roughly 2-4 weeks feels about right for the low end of difficult coding projects. See above for more.
80% time horizon at RE-Bench saturation

Eli: [0.5, 15] hours

 

NIkola: [0.5, 12] hours

See more above for why it’s likely less than 8 hours. My best guess is that the RE-Bench-saturating agent would have 60% reliability at 8 hours.

In METR’s report, they find that switching from 50% to 80% reduces the time horizon by 5x. So perhaps switching from 60% to 80% reduces it by ~3.5x, giving me a median of ~2.5 hours.

Time horizon doubling time as of Mar 2025 on HCAST[2.5 months, 9 months]See our rationale in Method 1.
Doubling time at REBench saturation toward our time horizon milestone, on a hypothetical task suite like HCAST but starting with only REBench’s task distribution[0.5 months, 18 months]

We intuitively aggregate the below 3 adjustments to get our estimate.

 

Adjustment downward and more uncertain for potential trend changes: If the trend is superexponential, the doubling time will be faster than today by the time REBench is saturated. The opposite is true for subexponential, which is less likely (see below for reasoning).

 

Adjustment to be more uncertain based on distribution shift from normal HCAST to HCAST starting with only REBench: We widen our confidence interval based on uncertainty regarding the starting task distribution. METR’s extrapolation already includes REBench, but it’s a small minority of tasks relative to HCAST.

 

Adjustment downward due to extrapolation overshooting our milestone: Our guess is that our extrapolation on the METR task suite would “overshoot” and lead to our time horizon milestone being eclipsed on some dimensions by the time AI reaches the required time horizon. Therefore, we should make an adjustment down to shift from the METR task suite doubling time to the doubling time on a theoretical task suite for which extrapolation led exactly to the time horizon milestone as we defined it above.

 

We’ve done some rough extrapolations which indicate that the HCAST extrapolation would in fact lead to about 10,000 lines of code, as we defined the milestone. But our guess is that a naive extrapolation would “overshoot” our milestone with regards to feedback loop difficulty, and potentially other variables as well.

Will doubling times speed up, slow, or stay the same?

Probabilities:

Exponential:

  1. Eli: 0.45
  2. Nikola: 0.5

Superexponential:

  1. Eli: 0.45
  2. Nikola: 0.4

Subexponential:

  1. Eli: 0.1
  2. Nikola: 0.1
See our rationale in Method 1.

Taking the doubling time and required level, we get these distributions as the size of the time horizon gap:

Other factors for benchmarks and gaps

We assume that no large-scale catastrophes happen (e.g., a solar flare, a pandemic, or a nuclear war), and no government or self-imposed slowdown.

Compute scaling and algorithmic progress slowdown

We assume that the rate of compute scaling is slowed by 2x beginning in 2029 due to reduced ability to increase investments, given that the rate of increase of frontier AI training costs may be difficult to continue past then without SC achieved.

Similarly, we project that if SC isn’t achieved by about 2028 the human research population will begin growing at a slower rate. For simplicity, we also model this as a 2x decrease in the human-driven rate of progress over time. To model complementarity with AI automation, we take the geometric of mean of the pace of progress if AI were a fixed multiplier on the human pace (i.e. default_human_plus_ai_rate*0.5) and the pace of progress if AI were fully additive (i.e. default_human_plus_ai_rate-0.5).

Gap between internal and external deployment

Because our models’ forecasts and extrapolations above are based on testing models which have been publicly released, we need subtract from our forecast to get the the arrival time of SC capabilities internal to the AGI developers.

We estimate that at the arrival of SC, AGI developers’ internal capabilities will be approximately a lognormal with an 80% CI of [0.25 months, 6 months] ahead of their public releases. This is subtracted from the time-to-achieve-SC to get to a time when SC is achieved internally.

Intermediate speedups

In our simulation, the rate of algorithmic progress starts at 1 in 2025 and reaches [5%, 60%] times the 2024 rate at RE-Bench saturation.

The table below shows Eli’s and Nikola’s estimates for how much SCs will speed up algorithmic progress, i.e. the AI R&D progress multiplier (see here for a more detailed definition). These are informed by our estimates for the SC progress multiplier in our takeoff forecast.

QuantityNikola’s estimateEli’s estimate

AI R&D progress multiplier from SC (median 80% CI of lognormal)[16]

5.5 [2.0, 20.0]

8.5 [2.5, 40.0][17]

We assume that in 2024 algorithmic progress represents half of AI progress, with the other half being compute. Progress might be very fast after the SC milestone: see the takeoff forecast for forecasts on the post-SC capabilities progression.

In the simulation:

  1. We get the number of months needed to bridge all task difficulty gaps at the 2024 rate of AI progress.
  2. We then progress through the total number of “2024-months” of progress, increasing the rate of AI progress according to how much of the process has been completed (in a very small subset of trajectories, the rate of AI progress goes down over time). The rate of progress goes up exponentially from the starting rate to the ending rate, as a function of how much of the total task difficulty gap has been crossed.

Overall benchmarks and gaps forecasts

Running the simulation as described by the parameters we’ve laid out results in this:

With these input distributions:

Appendix

Individual Forecaster Views for Benchmark-Gap Model Factors

Engineering complexity: handling complex codebases

Milestone which would indicate the gap being crossed: Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying >20,000 lines of code across files totaling up to >500,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, add the same cost and speed as humans. 

This milestone requires a 1x scaleup in modified lines of code (LOC) and a 25x scaleup in files LOC from the time horizon milestone.

Feedback loops: Working without externally provided feedback

Milestone which would indicate the gap being crossed: Same as above, but without provided unit tests and only a vague high-level description of what the project should deliver.

We recommend that future work consider using METR’s concept of “messiness” from their report in place or in addition to this milestone. We weren’t able to explore this due to time constraints.

Parallel projects: Handling several interacting projects

Milestone which would indicate the gap being crossed: Same as above, except working on separate projects spanning multiple codebases that interface together (e.g., a large-scale training pipeline, an experiment pipeline, and a data analysis pipeline). 

Specialization: Specializing in skills specific to frontier AI development

Milestone which would indicate the gap being crossed: Same as above, except working on the exact projects pursued within AGI companies. 

Cost and speed

Milestone which would indicate the gap being crossed: Same as above, except doing it at a cost and speed such that there are substantially more superhuman AI agents than human engineers (specifically, 30x more agents than there are humans, each one accomplishing tasks 30x faster). 

Other task difficulty gaps

Milestone which would indicate the gap being crossed: SC achieved. 

Superhuman Coder (SC): time horizon and reliability requirements

A superhuman coder (SC) must be able to overall do as good of a job as the combination of all human programmers at an AGI company at their current work.[19]

What time horizon and reliability level does this require on HCAST?

Eli’s opinion:

Time horizon and reliability required on real distribution of work tasks, as baselined by the best humans with strong incentives:

  1. Time horizon: 6 months (80% CI: [1 week, 12 years]). If AIs can fairly consistently do tasks that take humans 6 months, it seems like they should be able to automate large coding projects. Anything less than 1 week seems highly unlikely to be enough. I’d like to have an even fatter right tail than a lognormal here ideally, but I expect that once we’re getting into the years the trend will likely be pretty superexponential anyway.
    1. An alternate view: Given that human baseliners only score around a 90 minute time horizon, it’s also possible AI will outperform humans at many coding tasks by the time it has a 90 minute time horizon. 10-year time horizons seem like a sensible upper bound on the length of tasks the AI needs to be able to do, but it seems likely that even at a 1-month time horizon under METR’s current definition, AI will be automate large parts of the AI R&D process with a small amount input from other colleagues. We’ll take into account some uncertainty here with a lognormal with 80% CI of [16 hours, 2 work-years (4,000 hours)].
  2. Reliability: 80%. If the SC is equally well-rounded as the best humans, this would be somewhat below 50% for a few reasons in footnote.[20] Currently AIs are much less well-rounded than humans though, so if they have 40% reliability within human cost/speed they likely only go up to around 45-50% if allowed to take 10x longer. So with current AIs we might need to set a 90+% reliability threshold. SC-level AIs will be much more well-rounded though, due to having very strong agency skills (planning, correcting mistakes, etc.). So we lower it to 80%, which seems roughly right and has the advantage of being able to utilize METR’s reported data.

Time horizon required on an extrapolation of HCAST (METR’s task suite), with METR’s current baselining strategy: 10 years [1 month, 1200 years].

I’ll keep reliability the same and adjust the time horizon to tune it to METR’s report (their task suite and baselining process), allowing me to forecast more straightforwardly via extrapolation METR’s results. I make an adjustment based on the below considerations.

Reasons for raising the time horizon requirement:

  1. An extrapolation of the HCAST suite doesn’t cover gaps that will come up in the real world (poor feedback loops is my guess as to the most important gap, see here for some of the candidates that seem most prominent).
  2. The baselines for HCAST are weaker than ideal which inflates the time horizons relative to the setup assumed above (see more in the time horizon paper and the HCAST paper): (a) they are done by fairly competent people, but not the literal best humans (b) they aren’t always done by experts (c) they are done with people with low context (similar to new hires, rather than already being familiar with a codebase).
    1. Baseliners were found to take 5-18x longer to resolve METR issues in METR code repositories. However for longer horizon tasks existing familiarity wouldn’t matter as much because there would be time to acquire context.

Reason for lowering the time horizon requirement: There might be ways in which an extrapolated HCAST is actually harder than real world tasks (i.e. the opposite of (1) above). For example, some baseline scoring functions are unrealistically unforgiving.

While 1200 years sounds high, I think it’s plausible that there are very big gaps between HCAST and the real world or there are huge gaps between HCAST baselining and SC-level baselines.

Nikola’s opinion: Given that human baseliners only score around a 90- minute time horizon, it’s also possible AI will outperform humans at many coding tasks by the time it has a 90 minute time horizon. 10-year time horizons seem like a sensible upper bound on the length of tasks the AI needs to be able to do, but it seems likely that even at a 1-month time horizon under METR’s current definition, AI will be automate large parts of the AI R&D process with a small amount input from other colleagues. We’ll take into account some uncertainty here with a lognormal with 80% CI of [16 hours, 2 work-years (4,000 hours)].

RE-Bench saturation resolution criteria

Copied over from the AI 2025 Forecasting Survey

Any AI system counts if it operates within realistic deployment constraints and doesn't have unfair advantages over human baseliners.

Tool assistance, scaffolding, and any other inference-time elicitation techniques are permitted as long as:

The PASS@k elicitation technique (which automatically grades and chooses the best out of k outputs from a model) is a common example that we do accept on this benchmark because human baseliners in RE-Bench also have access to scoring metrics (e.g., loss/runtime). So PASS@k doesn't constitute a clear unfair advantage.

[...]

Human cost estimation process:

  1. Rank questions by human cost. For each question, estimate how much it would cost for humans to solve it. If humans fail on a question, factor in the additional cost required for them to succeed.
  2. Match the AI’s accuracy to a human cost total. If the AI system solves N% of questions, identify the cheapest N% of questions (by human cost) and sum those costs to determine the baseline human total.
  3. Account for unsolved questions. For each question the AI does not solve, add the maximum cost from that bottom N%. This ensures both humans and AI systems are compared under a fixed per-problem budget, without relying on humans to dynamically adjust their approach based on difficulty.

  1. ^
  2. ^

     Harvard University, part-time intern at METR

  3. ^

     futuresearch.ai, individual forecaster bios on futuresearch.ai/ai-2027

  4. ^

      The reason 5% is used is because it’s approximately the fraction that we project AI projects to be spending around the time they reach SC: see the “Research automation” row in our compute supplement.

  5. ^

     Time horizon for each task is defined as:

    For tasks without a fixed time cap, the geometric mean of the time taken by humans who completed it.

    For RE-Bench, in which humans had a fixed time cap, success is binarized based on the average score of human baseliners and the time horizon is considered to be the time cap (8 hours).

  6. ^

     The trend would likely further tilt toward superexponetiality if we took into account that the public vs. internal gap has seemed to decrease over time. It’s been rumored that GPT-4 was released 7 months after pre-training was complete, while it seems now there are much smaller delays; for example according to the announcement video Grok 3 was released a month after pre-training was complete.

  7. ^

     Another argument for eventually getting superexponentiality is that it seems like superhuman AGIs should have infinite time horizons. However, under the definition of time horizon adapted from the METR report above, it’s not clear if infinite time horizons will ever be reached. This is because AIs are graded on their absolute task success rate, not whether they have a higher success rate than humans. As long as there’s a decreasing trend in ability to accomplish tasks as the time horizon gets longer, the time horizon won’t be infinite. This is something that has been observed with human baseliners (see Figure 16 here). Even if infinite horizons are never reached, the time horizons might get extremely large which would still lend some support to superexponentiality. Even so, it’s unclear how much evidence this is for superexponentiality in the regime we are forecasting in.

  8. ^

     Perhaps this could be the case if extending to each successive time horizon requires doing large amounts of training on tasks of that horizon.

  9. ^

     (1.315 + 2.30 + 1.195 + 1.575 + 2.335 + 1.475 + 1.465)/7 = 1.67

  10. ^

     This budget is broken into either 32 30-minute attempts or 8 2-hour attempts from which we draw the best score, depending on if 30-minute or 2-hour time limits constituted better elicitation for each particular model.

     

    We only have data from davinci-002 using a single 10-hour time budget, however we manually reviewed its transcripts and are confident the model’s score is entirely due to noise in scoring and not any improvements it made to the codebase and it reliably fails substantially easier HCAST tasks, so we feel changing the scaffolding wouldn't alter the results.

     

    We also were unable to obtain more than an 8-hour time budget for GPT-4 0314, which is a minor limitation of our results.

  11. ^

     Because each gap is defined using a measurable milestone, this means that our predictions of when gaps will get crossed are empirical forecasts for when certain AI capabilities will be hit. Benchmarks that could actually resolve these predictions don’t yet exist (all relevant current benchmarks will already be saturated at that point), but it’s plausible that they will be created.

  12. ^

     Specifically, in order to model the correlation we model the CDFs of each function, and sample the percentile of each value by finding the CDF of a sample from multivariate normal with a correlation coefficient of 0.7.

  13. ^

     Time horizon for each task is defined as:

    For tasks without a fixed time cap, the geometric mean of the time taken by humans who completed it.

    For RE-Bench in which humans had a fixed time cap, success is binarized based on the average score of human baseliners and the time horizon is considered to be the time cap (8 hours).

  14. ^

     For this parameter only, we mean work-months, not calendar months. One work-month is 4 weeks * 40 hours/week of actual work, meaning a 1-month time horizon corresponds to a 160-hour time horizon as defined in the METR graphs.

  15. ^

     Recall that the time horizon is determined by taking the geometric mean of successful human completion times. A few reasons why this will lead to reliability levels below 50% for SC with the same skill level as baselined human teams, under our definition which requires the SC to solve the task at least as quickly and cheaply as the humans:

    Selecting only for successful achievements before taking the geometric mean artificially deflates the time horizons, leading to lower reliability.

    If many people of similar abilities did the same task over and over, probably the data would be somewhat right-skewed, which if the distribution were lognormal would mean the median and geometric mean are equal. However, it’s plausible that the distribution would be less skewed than lognormal so the geomean would be below the median, leading to lower reliability.

  16. ^

     The lognormal is between (lower bound - 1) and (upper bound - 1), since the lower bound on the multiplier is 1.

  17. ^

     I’m projecting that the first ARE will have roughly 80th percentile research taste, and a 20% chance that it will already be >= the best AI researcher.

  18. ^

     I’m assuming the cognitive tasks don’t involve other bottlenecks, as defined in the ARE milestone. Of course there are some non-cognitive bottlenecks in SWE like compiling, but these can likely be worked around.

  19. ^

     A full SC needs to do this faster and cheaper as well, but this will be discussed later.

  20. ^

     Recall that the time horizon is determined by taking the geometric mean of successful human completion times. A few reasons why this will lead to reliability levels below 50% for SC with the same skill level as baselined human teams, under our definition which requires the SC to solve the task at least as quickly and cheaply as the humans:

    Selecting only for successful achievements before taking the geometric mean artificially deflates the time horizons, leading to lower reliability.

    If many people of similar abilities did the same task over and over, probably the data would be somewhat right-skewed, which if the distribution were lognormal would mean the median and geometric mean are equal. However, it’s plausible that the distribution would be less skewed than lognormal so the geomean would be below the median, leading to lower reliability.

0 comments

Comments sorted by top scores.