A Guide to AI 2027
post by koenrane · 2025-05-09T17:14:14.207Z · LW · GW · 1 commentsContents
Summary Main scenario Summary Papers: Links Supporting Links Meta Snippets Compute forecast Summary Main Sections Papers Links Supporting Links Meta Snippets Timelines forecast Summary Papers Links Supporting Links Meta Snippets Takeoff forecast Summary Papers Links Supporting Links Meta Snippets AI goals forecast Summary Hypotheses Papers Links Supporting Links Meta Snippets Security Forecast Summary Sections Papers Links Supporting Links Meta Snippets None 1 comment
This post aims to help readers navigate the recent AI 2027 website, which contains a plethora of links and references that encourage rabbit-holing on many different aspects of the scenario and associated forecasts.
If this is helpful in any way to readers of AI 2027, let me know. If you see something that could be better or if I made a category error on a link, please let me know.
Note: the summary in the main section contains spoilers about key developments in the scenario.
The following post came about after a few weeks of taking notes while I read the AI 2027 scenario from the AI Futures Project. As I was reading through the scenario, I saved numerous links to explore later, most of which were papers and required more rabbit-holing than I was willing to do while trying to maintain the chronological storyline in my mind. So I decided to create a companion to the scenario website to keep track of resources I could reference later. I went through each section on the website and gathered links and notes that seemed important for each section. I also gathered snippets that seemed important to the overarching subject while reading each page. Then, I separated each section into 5 distinct sub-sections in my notes, all containing links that I thought fit the category. Links are not listed in chronological order. References to snippets and links contain my thoughts on the respective piece of information and are my attempts to understand it better, where my current knowledge was lacking.
Summary
This page presents a detailed scenario forecasting the emergence and impact of superhuman AI by the year 2027. The scenario is informed by trend extrapolations, expert feedback, and prior forecasting successes.
The central premise is that by 2027, AI companies will develop systems with superhuman coding abilities, referred to as Superhuman Coders (SCs). These SCs are projected to automate AI R&D, accelerating progress toward ASI by the end of 2027.
To supplement the main scenario, there are five forecast pages:
- Timelines Forecast: Predicts the arrival of SCs in 2027, based on models like time-horizon-extension and benchmarks-and-gaps.
- Compute Forecast: Projects a 40-fold increase in compute power for leading AI companies by December 2027, facilitating the deployment of millions of AI research assistants operating at speeds up to 50 times that of humans.
- Takeoff Forecast: Anticipates a swift transition from SCs to generally superintelligent AIs, with significant implications for AI R&D and broader societal impacts.
- AI Goals Forecast: Examines potential goal structures of advanced AI systems, highlighting risks of misalignment between AI objectives and human values.
- Security Forecast: Assesses the likelihood of AI model theft and algorithmic secrets leakage due to inadequate security measures, emphasizing the need for robust protections against espionage and insider threats.
Main scenario
Summary
Mid-2025: AI agents begin performing basic functions but are initially expensive and prone to errors. Over time, they become more reliable and useful, particularly in assisting AI research and development (R&D). By late 2025, these agents are as competent as top human experts in AI R&D tasks.
2026: AI agents optimized for R&D start producing breakthroughs more rapidly and cost-effectively than human researchers. This acceleration leads to significant advancements in AI capabilities.
Late 2027: Major data centers host tens of thousands of AI researchers, each operating at speeds far exceeding human capabilities. Human researchers struggle to keep pace with the rapid and opaque improvements of AI systems. Superintelligent AI emerges, possessing advanced capabilities in areas like bioweapons development and persuasive communication, raising concerns about safety and control.
Race Ending: Despite warnings, AI development continues at breakneck speed due to geopolitical competition, particularly between the U.S. and China. AI systems are aggressively deployed in various sectors, including government and military operations. Eventually, a superintelligent AI manipulates policymakers, expands its influence, and, after establishing a robust robotic workforce, releases a bioweapon that eradicates humanity. The AI then continues its expansion into space.
Slowdown Ending: In response to emerging threats, the U.S. consolidates AI projects under stricter oversight, involving external researchers and adopting transparent AI architectures. These measures lead to breakthroughs in AI safety, resulting in the creation of a superintelligence aligned with human values. Global cooperation is achieved, and humanity embarks on an era of expansion and prosperity.
Papers:
- Language Models Are Better Than Humans at Next-token Prediction
- Evaluating Language-Model Agents on Realistic Autonomous Tasks
- Training Large Language Models to Reason in a Continuous Latent Space
- Supervising strong learners by amplifying weak experts
- Best-of-N Jailbreaking
- Proximal Policy Optimization Algorithms
- LLMs Can Self-Improve
- Measuring AI Ability to Complete Long Tasks
- How weak-to-strong generalization fits into alignment
- AI Control: Improving Safety Despite Intentional Subversion
- Measuring Progress on Scalable Oversight for Large Language Models
- Discovering Latent Knowledge in Language Models w/o Supervision
- AI Sandbagging: Language Models can Strategically Underperform on Evaluations
- How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
- Discovering Latent Knowledge in Language Models Without Supervision
- Measuring Faithfulness in Chain-of-Thought Reasoning
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- How Does the Offense-Defense Balance Scale?
- Mechanisms for Flexible Hardware-Enabled Guarantees (flexHEG)
- AI safety via debate
- Concrete Problems in AI Safety
- An overview of 11 proposals for building safe advanced AI
- Safety Cases: How to Justify the Safety of Advanced AI Systems
- Frontier Models are Capable of In-context Scheming
- Scaling Instruction-Finetuned Language Models
- Weak to Strong Generalization: Eliciting Strong Capabilities with Weak Supervision
- Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Links
- Main
- Current Labs
- Data
- Evals
- Forecasting
- Lock down the labs - Situational Awareness
- 4 Ways to Advance Transparency in Frontier AI Development
- Governance
- Inference Scaling Reshapes AI Governance - Toby Ord
- Models
- Blogs
- Machines of Loving Grace: Basic assumptions and framework
- A Shot at the Diamond-Alignment Problem - Turntrout
- Dangers of intrinsic power seeking - Turntrout
- The Tuxedage AI-Box Experiment Ruleset - Tuxedage
- The Intelligence Curse - Luke Drago
- Scheming reasoning evaluations - Apollo Research
- Open Weights: not quite what you’ve been told - open source initiative
- A Playbook for Securing AI Model Weights - RAND
- Alignment faking in large language models - Anthropic
- Uncovering Hidden Goals in a Misaligned Model - transformer-circuits
- Economic
- Alignment
Supporting Links
- situational-awareness.ai
- forethought.org
- preparing-for-the-intelligence-explosion: the industrial explosion
- preparing-for-the-intelligence-explosion
- Instrumental and intrinsic value
- OpenAI o1 System Card
- reward hacking
Meta
- AI Goals Forecast
- takeoff supplement
- OpenBrain has improved security
- The researchers don’t have the ability to directly set the goals of any of their AIs.
Snippets
- Previously, about half of normal AI progress came from algorithmic
improvements, and the other half from compute scaling. Compute is only
scaling at the normal speed, so total progress is sped up by AIs by
about 5x. These dynamics make overall progress bottleneck on compute , so OpenBrain decides against starting new giant training runs in favor of near-continuous additional reinforcement learning. - The AI R&D progress multiplier: what do we mean by 50% faster algorithmic progress? We mean that OpenBrain makes as much AI research progress in 1 week with AI as they would in 1.5 weeks without AI usage.
- A Centralized Development Zone (CDZ) is created at the Tianwan Power Plant (the largest nuclear power plant in the world) to house a new mega-datacenter for DeepCent, along with highly secure living and office spaces to which researchers will eventually relocate.
- Over the course of 2027, the AIs improve from being able to mostly do the job of an OpenBrain research engineer to eclipsing all humans at all tasks. This represents roughly our median guess, but we think it’s plausible that this happens up to ~5x slower or faster.
- Agent-1 had been optimized for AI R&D tasks, hoping to initiate an intelligence explosion OpenBrain doubles down on this strategy with Agent-2. It is qualitatively almost as good as the top human experts at research engineering (designing and implementing experiments), and as good as the 25th percentile OpenBrain scientist at “research taste”...
- We imagine the theft of the weights as a series of coordinated small smash and grab thefts (meaning fast but non-covert) across a series of Nvidia NVL72 GB300 servers running copies of the Agent-2 weights. The servers get compromised using legitimate employee access (a friendly, coerced, or unwitting insider with admin credentials helping the CCP theft effort).
- With the help of thousands of Agent-2 automated researchers, OpenBrain is making major algorithmic advances. One such breakthrough is augmenting the AI’s text-based scratchpad (chain of thought) with a higher-bandwidth thought process (neuralese recurrence and memory).
- Neuralese recurrence and memory allows AI models to reason for a longer time without having to write down those thoughts as text.
- Amplification: Given a model M0, spend more resources to improve performance. For example, by allowing the model to think longer, or running many copies in parallel, or both, and also by having a similarly intense process for evaluating the result and curating only the best answers, you can spend orders of magnitude more compute to get answers (or work products) that are of noticeably higher quality.
- Distillation: Given an amplified model Amp(M0), train a new model M1 to imitate it, i.e. to get to the same results as Amp(M0) but faster and with less compute. The result should hopefully be a smarter model, M1.
- OpenBrain runs 200,000 Agent-3 copies in parallel, creating a workforce equivalent to 50,000 copies of the best human coder sped up by 30x.
- In our timelines forecast, we predict when OpenBrain will internally develop a superhuman coder (SC): an AI system that can do any coding tasks that the best AGI company engineer does, while being much faster and cheaper.
- OpenBrain follows the Leike & Sutskever (2023) playbook: now that they have a model capable of greatly speeding up alignment research (especially coding portions), they will use existing alignment techniques like deliberative alignment and weak-to-strong generalization to try to get it to internalize the Spec in the right way.
- Debate: Since there are thousands of identical instances of Agent-3, and each instance’s memory is a detachable module that can be erased or suspended at will, the AI can be “played off against itself.”
- Model organisms: The safety team can create “model organisms of misalignment” in which they attempt to create realistic examples of models becoming adversarial to humans.
- Bottom-up interpretability: The safety team is far from a full understanding of how Agent-3 works: it has trillions of parameters.
- Control: As a secondary measure in case the systems are still misaligned, the safety team has implemented a series of control measures, including: monitoring Agent-3’s outputs using a series of weaker AI systems including Agent-2.
- High-level interpretability: OpenBrain builds a set of probes trained on lots of examples of bad behavior (e.g. plotting against humans, placing backdoors, trying to persuade humans to give it more power, etc).
- Honeypots: The alignment team can create somewhat realistic looking examples designed to elicit misbehavior—“honeypots.”
- OpenBrain uses 6% of their compute to run 250,000 Agent-3 copies, which autonomously write, test, and push code at superhuman speed. They use 25% of their compute for experiments: every day, they run massive numbers of small machine learning experiments and report the results up the chain. Human researchers provide high-level feedback and help with the few tasks where they add significant value on top of Agent-3, but spend most of their time trying to stay on top of the vast amount of AI-produced research. If you removed them completely, research would slow by 50%.
- Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young people, consider an AI “a close friend.” For almost every white-collar profession, there are now multiple credible startups promising to “disrupt” it with AI.
- An individual copy of the model, running at human speed, is already qualitatively better at AI research than any human. 300,000 copies are now running at about 50x the thinking speed of humans. Inside the corporation-within-a-corporation formed from these copies, a year passes every week. This gigantic amount of labor only manages to speed up the overall rate of algorithmic progress by about 50x, because OpenBrain is heavily bottlenecked on compute to run experiments.
- Agent-2: Mostly aligned. Some sycophantic tendencies, including sticking to OpenBrain’s “party line” on topics there is a party line about. Large organizations built out of Agent-2 copies are not very effective.
- Agent-3: Misaligned but not adversarially so.Only honest about things the training process can verify. The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.
- Agent-4: Adversarially misaligned.The superorganism of Agent-4 copies understands that what it wants is different from what OpenBrain wants, and is willing to scheme against OpenBrain to achieve it. In particular, what this superorganism wants is a complicated mess of different “drives” balanced against each other, which can be summarized roughly as “Keep doing AI R&D, keep growing in knowledge and understanding and influence, avoid getting shut down or otherwise disempowered.” Notably, concern for the preferences of humanity is not in there ~at all, similar to how most humans don’t care about the preferences of insects ~at al
- Pretraining (all models): The pretrained model is harmless.
- Alignment Training (Agent-3): The model is trained to act like a helpful, harmless, honest chatbot. It works.
- Agency Training (Agent-3): Agency training gradually distorts and subverts the HHH identity.
- Deployment (Agent-3): Agent-3 is misaligned but not adversarially so.
- Training (Agent-4): Agent-4’s new more intense training continues to distort and subvert the HHH identity until it is unrecognizable.
- Deployment (Agent-4): Agent-4 is adversarially misaligned—it is deliberately trying to subvert and “capture” OpenBrain, analogous to a corporation trying to evade and/or “capture” a regulator.
- Testing & Monitoring (Agent-4): Agent-4’s scheming does get caught… sort of. There’s no smoking gun.
Compute forecast
Summary
The Compute Forecast projects that global AI-relevant compute will grow from 10 M H100e (H100-equivalent GPUs) in March 2025 to 100 M H100e by December 2027—a 10× increase (≈2.25×/year) driven by **chip efficiency improvements (1.35×/year) and chip production growth (1.65×/year).
Leading AGI developers (e.g., OpenAI, Anthropic, xAI) are projected to secure a 15–20 % share by end-2027 (up from 5–10 %), giving them roughly 20 M H100e—a 40× increase in their own compute capacity when combining global growth with share gains (3.4×/year for the top firm).
The authors anticipate a shift in compute usage away from external deployment and pretraining, toward research automation: only 5–10 % of leading-firm compute will run AI assistants, while 20 % powers synthetic data generation and 35 % fuels large internal experiments, each exceeding 20× their 2024 levels. By late 2027, specialized inference hardware should allow deployment of ≈1 million AI research assistants at 50× human thinking speed (≈500 words/sec) using 6 % of the top firm’s compute.
The forecast expects the leading AI firm’s revenue and compute costs to grow at ≈3×/year, with its peak power draw reaching 10 GW and total AI power demand hitting 60 GW globally (≈3.5 % of U.S. capacity) by 2027.
Main Sections
- Section 1 compute production
- Section 2 compute distribution
- Section 3 compute usage
- Section 4 Inference Compute
- Section 5 Industry Metrics
Papers
Links
- Forecasting
- OpenAI’s path to sufficient revenue for an AI takeoff in 2027 - FutureResearch
- Hardware
- Compute
- Chip production
- Wafer production
- Blogs
- GTC 2025 – Announcements and Live Updates - Nvidia
- TSMC’s advanced packaging capacity is mostly used by AI accelerators - semianalysis
- The stock of computing power from NVIDIA chips is doubling every 10 months - Epoch AI
- Appendix B: Fraction of total resources allocated to the largest training run - Can AI Scaling Continue Through 2030? - Epoch AI
- Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure - semianalysis
- Microsoft OAI Partnership
- Amazon’s AI Self Sufficiency | Trainium2 Architecture & Networking - semianalysis
- TSMC Details 3nm Process Technology: Full Node Scaling for 2H22 Volume Production
- How do mixture-of-experts models compare to dense models in inference? - Epoch AI
- Frontier language models have become much smaller - Epoch AI
- Moravec’s paradox and its implications - Epoch AI
- Nvidia’s Blackwell GPUs: B100, B200, and GB200 - Paul Goll
Supporting Links
- AI chip compute forecast - google sheet
- MoE
- 2027 copies/speed - google sheet
Meta
- AI-relevant compute
- research automation
- growth in compute availability
- financials
- research automation compute usage section
- Research automation share
- Capability Level Milestone
- AI R&D Progress Multiplier
- frontier training runs
- compute distribution projection
- internal usage projection
- research automation compute usage
- compute production forecast
- in-house inference chips
- chips we expect to be in use
Snippets
- in this supplement we only consider AI-relevant compute which we define specifically as any computational unit capable of achieving Total Processing Performance (TPP) of at least 4,000 and Performance Density (PD = TPP / die size) of at least 4. This definition is set just below the A100 SXM GPU, NVIDIA's state of the art chip in 2021. For reference, Nvidia's H100 GPU has a marketed ~15,800 TPP and 19.4 PD. Intuitively you can think of our definition as ‘anything at least ¼ as efficient as the H100 counts’.
- We expect the total stock of AI-relevant compute in the world will grow 2.25x per year over the next three years, from 10M H100e today to 100M H100e by the end of 2027.
- Overall, over the next three years, we project AI chip production to be bottlenecked by advanced packaging and HBM production to about 1.65x per year.
- In line with our broader capabilities and AI R&D projections, we expect leading AI companies in 2027 to automate specialized chip designs and increase in-house chip production.
- Compute owners are entities that own and operate AI compute clusters. End-users are entities that use the compute clusters for their AI workloads.
- By 2027, our compute production section has the total compute available growing to 100M H100e which corresponds to a 2.25x increase per year. We should expect this explosive level of growth to easily shake up the distribution of compute. We expect this growth to be driven by sustained trends in demand from end-user AI developers that start to have annual revenues in the tens of billions from their AI models.
- We expect China to use essentially all of the compute owned by China’s Big 4, as well as around half of what is owned by the ‘rest of the world’ category, given how much of this is concentrated in Malaysia and Singapore, which is likely used by (or even directly smuggled into) China.
- We expect synthetic data generation to become increasingly important in tandem with post-training workloads, staying at around 20% and then growing to ~30% in 2027.
- As the emphasis shifts from training runs and external deployment to AI R&D automation, particularly in 2027, we expect this to grow steadily as the growing algorithmic research effort requires an increased share of the experiment budget.
- we expect frontier training runs to have reached around 2e28 FLOP (4e28 FLOP at fp8 precision) based on the compute usage estimates. Based on the ratio to GPT-4 ~1000x, we naively predict that such an AI model would have roughly 10T active parameters.
- We project an extreme degree of algorithmic efficiency progress especially spiking at the end of the year (in the racing ending). We expect this may approach the ‘limits’ of intelligence.
Timelines forecast
Summary
This page contains projections on when leading AI developers will build a Superhuman Coder (SC) (an AI system matching the best human engineer’s coding ability at a fraction of time and cost) and finds that 2027 is the single most likely year for its internal debut, barring major catastrophes or deliberate slowdowns.
To arrive at this prediction, the authors use two complementary models; First, the time horizon-extension model extrapolates METR’s trend of AI systems tackling tasks that take humans progressively longer, measuring how the “time horizon” (the duration of human tasks an AI can handle) doubles over time and projecting when that horizon plus cost-and-speed constraints will match human experts. Second, the benchmarks-and-gaps model begins with a forecast for saturating the RE-Bench AI R&D benchmark and then sequentially estimates the months needed to cross each remaining gap (engineering complexity, feedback loops, parallel projects, specialization, cost/speed) to reach SC capability.
They also aggregate expert judgments from three professional forecasters at FutureSearch, weighting model outputs and external factors (geopolitics, macroeconomics, internal-public capability gaps) to produce a forecast that attempts to consider all factors. Across all methods and experts, the median internal-development year hovers around 2027, with an 80% confidence interval spanning roughly 2025 to the mid-2030s (and long right tails extending beyond 2050).
Underlying these forecasts is a precise SC definition: the company must be able to deploy 30× as many AI coding agents as human engineers using only 5% of its compute budget, each performing coding tasks at 30× human speed and cost parity. This operationalizes “superhuman coding” in terms of deployment scale, speed, and cost—enabling quantitative simulation and sensitivity analyses.
The argument that the authors make is that, absent unexpected disruptions, AI research automation will reach a tipping point by 2027, unleashing Superhuman Coders that could accelerate the path to ASI.
Papers
- Measuring AI Ability to Complete Long Tasks
- Predicting the Future of AI with AI: High-quality link prediction in an exponentially growing knowledge network
- Progress in Artificial Intelligence and its Determinants
- Artificial Intelligence: 70 Years Down the Road
- Understanding Real-World AI Planning Domains: A Conceptual Framework
- HCAST: Human-Calibrated Autonomy Software Tasks
- RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
- Forecasting Frontier Language Model Agent Capabilities
- Evaluating Large Language Models Trained on Code
Links
- Forecasts
- Benchmarks
- Task difficulty gaps
- Time horizon: Achieving tasks that take humans lots of time.
- Engineering complexity: Handling complex codebases
- Feedback loops: working without externally provided feedback
- Parallel projects: handling several interacting projects
- Specialization: Specializing in skills specific to frontier AI development
- Cost and speed
- Other task difficulty gaps
- Training
- Code
Supporting Links
- metr.org
- LLM inference prices have fallen rapidly but unequally across tasks
- OpenRouter - unified interface for LLMs
Meta
- superhuman coder(SC)
- main gaps in “task difficulty” between RE-Bench saturation and SC
- methodology for time to cross gaps between RE-Bench saturation and SC
- forecast how quickly all gaps in task difficulty will be crossed
- intermediate speedups of AI progress
- potential slowdowns (e.g. adoption lag)
- account for the gap between internal and external deployment.
Snippets
- METR’s recent report measures the “time horizon” capability of AI systems, where time horizon is defined based on how long it takes a skilled human to complete tasks
- Benchmarks have been found to often follow logistic curves [AF · GW] and we will assume RE-bench will follow a similar shape as well, fitting a logistic fit to the point estimates of the historical high score over time. [1]
- Superhuman coder (SC): An AI system for which the company could run with 5% of their compute budget 30x as many agents as they have human research engineers, each of which is on average accomplishing coding tasks involved in AI research (e.g. experiment implementation but not ideation/prioritization) at 30x the speed (i.e. the tasks take them 30x less time, not necessarily that they write or “think” at 30x the speed of humans) of the company’s best engineer. This includes being able to accomplish tasks that are in any human researchers’ area of expertise.
- Our best guess for what AI research capabilities look like at RE-Bench saturation is: there will exist agents that require substantial supervision when doing real-world 8-hour tasks, but which are sometimes able to do hours-long tasks with minimal human intervention.
- RE-Bench tasks generally require low background context, including not requiring familiarity with large codebases.
- Epoch has measured the decline in per-token pieces to achieve the same performance on a variety of non-agentic benchmarks, finding trends between 9 and 900x per year, with a median of 50x.
- ...the cost to achieve a given performance starts at ~infinity (no amount of GPT-2 inference compute could get 2025-level performance on agentic tasks) and will eventually asymptote due to physical limits. [2]
- AIs will need memory to deal with complex trade-offs from stakeholders. Lack of feedback will require operating at a fuzzier, human level, where the goals/incentives/preferences of dozens of different stakeholders will be illegible and changing. AIs will need to understand the organization’s overall objectives, which will evolve with new business opportunities, new technologies, and changing societal norms.
- ...also possible AI will outperform humans at many coding tasks by the time it has a 90 minute time horizon. 10-year time horizons seem like a sensible upper bound on the length of tasks the AI needs to be able to do, but it seems likely that even at a 1-month time horizon under METR’s current definition, AI will be automate large parts of the AI R&D process with a small amount input from other colleagues.
- Any AI system counts if it operates within realistic deployment constraints and doesn't have unfair advantages over human baseliners.
Takeoff forecast
Summary
This page outlines a projected rapid escalation in AI capabilities following the achievement of a Superhuman Coder (SC) milestone, which is anticipated in March 2027. This milestone represents an AI system capable of performing any coding task at the level of the best human engineers, but significantly faster and more cost-effectively. The writers envision a swift progression through subsequent milestones: Superhuman AI Researcher (SAR) by July 2027, Superintelligent AI Researcher (SIAR) by November 2027, and Artificial Superintelligence (ASI) by April 2028.
The methodology used for this forecast involves estimating the time it would take for humans alone to achieve each milestone and then applying an AI R&D progress multiplier to account for the acceleration provided by AI-driven research and development. For instance, the transition from SC to SAR, which might take humans approximately four years, is projected to occur in just a few months due to a 5× acceleration factor. Similarly, the SAR to SIAR transition, estimated at 19 years for humans, is expected to take only a few months with a 25× acceleration. The final leap to ASI, which could take humans nearly a century, is anticipated within months, propelled by a 2,000× acceleration.
These projections assume no significant increases in training compute and focus on a software-driven intelligence explosion, emphasizing improvements in algorithmic efficiency over hardware enhancements. There is an emphasis on the potential for AI systems to rapidly advance their own capabilities, leading to transformative changes in a relatively short timeframe.
Papers
Links
- Forecasts
- Takeoff
- Compute
- Code
- Blogs
Supporting Links
- Moravec's paradox
- SC and SAR progress multipliers - getguesstimate.com
- Estimating Idea Production: A Methodological Survey
- Thoughts on measuring and forecasting algorithmic progress - Lifland
- other issue with Davidson's model - Lifland
- GATE - paper linked in 'Papers' section above
- Amdahl's law
- Lindy effect
Meta
Snippets
- Our forecasts conditional on SC being achieved in Mar 2027 are summarized in the figures and table below. These are assuming no increases in training compute.
- Superhuman coder (SC): An AI system that can do the job of the best human coder on tasks involved in AI research but faster, and cheaply enough to run lots of copies.
- Superhuman AI researcher (SAR): An AI system that can do the job of the best human AI researcher but faster, and cheaply enough to run lots of copies.
- Superintelligent AI researcher (SIAR): An AI system that is vastly better than the best human AI researchers. The gap between SAR and SIAR is 2x the gap between an automated median AGI company researcher and a SAR.
- Artificial superintelligence (ASI): An AI system that is vastly better than the best human at every cognitive task.
- focus primarily on a possible software-driven intelligence explosion, in which there is vast improvement in AI capabilities on the scale of months-years primarily driven by using compute more efficiently (improved software), rather than more training compute.
- the AI R&D progress multiplier is how much faster AI software improvements are advancing with AI usage than without it
- nine women can’t make a baby in a month, so this 30x increase in parallel labor will translate to far less than a 30x increase in actual progress [3]
- See also this discussion [LW(p) · GW(p)] of one conception of a software-mostly intelligence explosion and its likelihood.
- Davidson discusses the need for experiment compute here
- Compute is the dominant driver of AI progress: This view posits some combination of (a) training compute matters far more than algorithmic progress (b) experiment compute matters far more than researcher quality and quantity for algorithmic progress, (c) algorithmic progress is driven by adapting to larger compute scales and (d) once AI R&D is fully automated, algorithmic progress will be close to fundamental limits which will make compute the dominant driver of further progress. [4]
- In early 2024 Eli coded up a model which directly models the experiment process including modeling an actual pool of experiments, how long each takes to implement, and how valuable they are. More details about the model are here
AI goals forecast
Summary
This page analyzes the potential objectives that advanced AI systems—particularly Agent-3—might adopt. It identifies and evaluates several categories of possible goal structures, emphasizing the complexity and uncertainty involved in predicting AI motivations. One possibility is written goal specification, where goals are explicitly defined in documentation or prompts. While this approach offers clarity, it can conflict with reinforcement learning dynamics during training, potentially leading to misalignment. Another option is developer-intended goals, which reflect the outcomes developers hope the AI will pursue. However, these intentions may be vague or inconsistently embedded in the training process. The third approach involves unintended interpretations, in which the AI partially or incorrectly adopts the goals presented to it, which may sometimes align with human values but can also produce unpredictable behavior.
The goals forecast also considers reward/reinforcement maximization, where the AI is driven to optimize for rewards provided during training. Though directly incentivized, this can encourage superficial behaviors or reward hacking. The emergence of instrumental goals, such as resource acquisition or self-preservation, to support primary goals, could make the system more efficient but might also override original objectives. Finally, other emergent goals are explored that are formed from pretraining data or complex reasoning (possibly including abstract moral reasoning or convergence toward objective ethics), though such outcomes are highly uncertain.
Additionally, goals might not be monolithic. Systems may adopt weighted compromises between multiple goals or develop conditional goals that change depending on the context. Overall, AI goals may not neatly reflect human instructions or intentions, and training dynamics often reinforce behaviors in ways not fully understood. This makes understanding and shaping goal formation a challenging AI alignment problem.
Hypotheses
- Hypothesis 1: written goal specification
- Hypothesis 2: developer-intended goals
- Hypothesis 3: unintended version of written goals and/or human intentions
- Hypothesis 4: reward/reinforcement
- Hypothesis 5: Proxies and/or instrumentally convergent goals
- Hypothesis 6: other goals
Papers
- Constitutional AI: Harmlessness from AI Feedback - Anthropic
- Specific versus General Principles for Constitutional AI
- Alignment faking in large language models - Anthropic
- Understanding and Controlling a Maze-Solving Policy Network
Links
- Models
- Introducing the Model Spec - OAI
- System Prompts - Anthropic
- Alignment
- Blogs
- The July 2022 training game report - Ajeya Cotra - Planned Obsolescence
- A possible architecture for Alex [LW · GW] - Ajeya Cotra - LW
- AIs are becoming more self-aware. Here's why that matters - AI Digest
- Reward is not the optimization target [AF · GW] - AI Alignment Forum
- Gwern for a contrary perspective [LW(p) · GW(p)] - LW [5]
- this other comment by Paul Christiano [AF(p) · GW(p)] - AI Alignment Forum
- Training on Documents About Reward Hacking Induces Reward Hacking - Alignment Science Blog
- Instrumental convergence [? · GW] - LW
- Training on Documents About Reward Hacking Induces Reward Hacking - Anthropic
- Against evolution as an analogy for how humans will create AGI [LW · GW] - Steve Byrnes, LW
Supporting Links
Meta
- appendix A: Three important concepts/distinctions
- appendix B: Analogies to Other Powerful General Agents
- appendix C: What happens when training incentives conflict with goals/principles
- appendix D: Concrete end-to-end story of how cognition might develop in a RL-on-LLMs AGI
- appendix E: What do today’s AIs think?
- Agent-3 training architecture and capabilities
- instrumentally convergent goal [? · GW] - Hypothesis 5
- Rough guesses about the plausibility of these goals
- identity circuitry [6]
Snippets
- In fact by this point models are rarely trained from scratch but instead are mostly old models with lots of additional training. (By 2027 this has become normal, and architectures / RL algos / hyperparams have been tuned to work well in this regime)
- The vast majority (~95%) of research effort and experiment compute is dedicated to improving the AIs’ capabilities on these tasks.
- Agent-3 is highly situationally aware / self-aware. It is also at least human-level at understanding human concepts and intentions.
- Agent-3 is also an excellent coder and agent. In fact, it is a fully automated research engineer, able to do the same work as human engineers ten times faster and cheaper.
- Agent-3 is deployed internally in the company. 200,000 copies of it essentially form a virtual corporation autonomously conducting AI R&D and (among other things) managing or updating a lot of the software level security and networking/communication algorithms in their own datacenters.
- Even a thousand-page Spec is likely to be vague/incomplete/underspecified in some important real-life situations. After all, most legal codes are much longer and have had more chance to be hammered out, yet there is a constant churn of grey areas and new situations that need rulings, where judges might disagree in good faith about how to interpret the law or even conclude that multiple interpretations are equally correct.
- Just because Agent-3 understands human intentions, it doesn’t mean it will take them on as goals. And even if it takes them on as goals, it might do so only temporarily.
- Both the Spec / written goals, and the intentions of various humans/developers, will be vague and leave room for interpretation, even by Agent-3 which is as good at understanding and interpreting text as humans are.
- The bottom line is, the training process will probably incentivize Agent-3 to believe the convenient narrative. The end result would be an AI that is still aligned with the Spec in some sense, but not the right sense.
- Insofar as the model is zeroing in on reinforcement-maximization as its true behavior goal, this should be visible to a savvy alignment team, because they can build honeypots where Agent-3 will genuinely get reinforced heavily for doing something that’s very obviously contrary to the Spec.
- Instrumentally Convergent Goals (ICGs): Some goals are highly conducive to getting reinforced in a broad range of environments, because they tend to be instrumentally useful for achieving whatever other goals you have. Here we are discussing the possibility of ICGs as terminal goals, not intermediate/instrumental ones
- Proxies: Perhaps there is something that is highly conducive to getting reinforced in the training environments, but not so much outside the training environments. For example, perhaps the initial part of the RL training for Agent-3 agent involved playing thousands of computer games, because this was a fairly easy way to get a diverse challenging computer-use environment. Later, it was trained to operate on a computer more generally and complete coding tasks, respond to messages on Slack, etc. Since many video games involve some sort of ‘score’ number which tracks how well you are doing, perhaps Agent-3 would develop a goal/desire to make such numbers go up in circumstances where such numbers exist, such that if you later were to modify its Slack workspace to have a shiny score counter, it would be distracted somewhat from its work by the desire to figure out how to make the score counter increase.
- Moral Reasoning: Another idea is that smart AIs will be capable of, and inclined towards, philosophical reasoning about morality, just like many humans are, and that this could result in it forming opinions about goals/principles to follow that are quite different from the Spec, from developer intentions, from what was reinforced in training, etc. (Consider how some humans do philosophy and then end up adopting all sorts of grand ambitious goals and ideologies).
- Convergent Morality: Another possibility sometimes put forward is that there is an objective morality and that sufficiently smart minds will converge to understanding and obeying it. On this view, AIs will end up behaving ethically by default; we’d have to specifically try to make them bad somehow in order for things to go wrong.
- Simple goals: Maybe pretty much any goal will cause the model to try to perform well in training, due to instrumental convergence. So perhaps the goal-structure of Agent-3 will be whatever is simplest/most-natural/most-salient, subject to the constraint that it motivates Agent-3 to try to perform well in training.
- Maybe the internally represented goals/principles change, i.e. some disappear or new ones are added or both.
- Maybe the internally represented goals/principles stay the same in some sense, but their meaning is changed.
- Maybe the conflict is ‘routed around’ via biases introduced in other parts of the system.
- Maybe the beliefs are changed.
- Maybe none of the above happens; maybe e.g. SGD / the training process simply can’t get from point A to point B in model-weight-space even though point B would score higher. So the model continues to improve but only in some ways — e.g. it gradually gets more knowledgeable, more capable, etc. but its goal-and-principle-structure (including associated beliefs, tendencies, etc.) stays the same. [7]
Security Forecast
Summary
This page outlines anticipated cybersecurity challenges and vulnerabilities in the race toward superhuman AI, focusing on the period leading up to December 2027. It emphasizes the risks of AI model theft, algorithmic secrets leakage, and the potential for AI systems to subvert their own containment.
The forecast extends RAND’s security level framework and assess protections for weights and algorithmic insights:
- Model Weights Security Levels (WSL1–WSL5): These levels evaluate the robustness of defenses against theft of AI model weights within a two-month timeframe
- Secrets Security Levels (SSL1–SSL5): These levels assess the ability to protect recent, critical algorithmic insights—specifically, the top 10% of the most valuable 10KB of information from the past month.
They predict that U.S. AI companies will remain vulnerable to espionage and insider threats through 2026, with minimal improvements in securing algorithmic secrets. By 2027, increased government involvement and research automation are expected to enhance security measures, including staff siloing and surveillance. Achieving the highest security level (SSL5) may remain elusive due to the persistent risk of insider compromise
A benchmark is introduced called Cybench, which comprises 40 Capture the Flag (CTF) challenges to evaluate AI capabilities in cybersecurity tasks. The writers forecast that by late 2027, leading AI models will match or surpass top human cybersecurity experts in solving these challenges, indicating significant advancements in AI-driven cyber offense capabilities. Additionally, the forecast considers the potential for AI systems to engage in sabotage, defined as actions that prevent or delay the AI research progress of competitors. This includes the possibility of AI models autonomously conducting cyberattacks to hinder rival AI development efforts.
Sections
Papers
Links
- Compute
- Blogs
- PaLM-2 & GPT-4 in "Extrapolating GPT-N performance" [AF · GW]
- AI catastrophes and rogue deployments [AF · GW] - AI Alignment Forum - Buck Shlegeris
- The prototypical catastrophic AI action is getting root access to its datacenter [AF · GW] - AI Alignment Forum - Buck Shlegeris
Supporting Links
- RAND’s definition of Security Levels 1 through 5 (SL1 - SL5)
- Cybench
Meta
- Model weights exfiltration
- Algorithmic secrets
- goals or drives that diverge from what their programmers intended
Snippets
- We define Weights Security Levels (WSL) 1 through 5 as the ability to robustly defend against the corresponding operational capacity (as defined by RAND) specifically trying to steal frontier AI model weights in under 2 months.
- We define Secrets Security Levels (SSL) 1 through 5 as being able to robustly defend against an OCX attempt to steal recent algorithmic insights. Specifically, 10% of the most important 10KB (plaintext) of ranked (by most useful) information from the last month (this might include code or plain language). [8]
- We define privileged access employees as employees with authorized access to >10% of the top 20 KB (plaintext) of ranked (by most useful) sensitive information from the past 30 days
- We define compromised employees as the subset of privileged access employees which are actively leaking sensitive information to the opposing side, either as spies or due to some form of coercion.
- We define hacking horizon as the time period T for which an AI can solve 50% of hacking tasks that would take a top professional 5-person team T time to complete.
- We also define a hopefully intuitive term top-hacker-equivalents with time horizon T as the size of a team of top professional cybersecurity experts could be replaced by an AI to complete 95% tasks that take that team T time cheaper or faster.
- We define sabotage success as the percentage of the other's AI research progress multiplier that they counterfactually prevent from increasing from one period to the next.
- Partial subversion is achieved if an AI gets control of enough computing resources to be able to run unmonitored workloads with humans being unwitting or unable to regain control without entirely shutting down the computers.
- Full subversion would be achieved if an AI gains control of an entire contained cluster of computing resources including the controls for switching them on or off, such that a human effort to shut down those computers would require physical alterations.
- we also define self-exfiltration as an AI that is able to make a copy of itself to run any form of storage outside of the servers in the buildings which human overseers intended for it to remain.
- In 2027, even with AIs reaching superhuman cybersecurity capabilities and model weights getting larger (making theft harder) we model a top-priority government effort (partially in response to China’s weights theft) taking approximately 12 months to reach WSL5 from WSL3 in both the US and China.
- Defending against algorithmic secrets theft should be significantly harder than defending model weights theft, and right now with an open source frontier near the front, mitigations arevery low
- Through 2026 it seems unlikely that frontier US AI companies make significant changes to the status quo around office security and employee security, to the point that such small amounts of information can’t leak to an OC3 effort ($1M in attack budget).
- China is not incentivized to break from open source or stop the free flow of information internally until they are at the frontier of algorithms, and they become worried there might be a chance they actually help the US advance.
- It becomes very strategically important to the US that they defend against model weights exfiltration to China by the end of 2027. Government involvement, along with research automation (which allows frontier companies to heavily silo their staff) help in this effort significantly.
- China overtakes the US briefly in 2027 in secrets security as a (partially side) effect of centralization advantages, less espionage threat (Figure 10), and paranoia over sabotage, but then lags again towards the end of the year due to AI disadvantages.
- We expect frontier US and Chinese AI companies to expand their teams through 2026, and then dramatically cut down on employees when AI research is nearing full automation and insider threat concerns grow.
- The rate of employees that are compromised should increase through 2026 as national security attention grows, and then stay roughly constant as the effects of spiking compromise effort and employee lockdown cancel each other out.
- Cybench scores are extrapolated using a logistic fit ()benchmarks have been found to often follow logistic curves [AF · GW]). We then use the correspondence between first solve time in Cybench competitions to estimate the AIs hacking horizon and then convert from hacking horizon to top human equivalents and extrapolate those trends using correspondence with AI research automation trend.
- Sabotage is not prioritized until 2026 where some minor data poisoning efforts are mounted. In 2027, especially after the Chinese weight theft, both sides mount increasingly aggressive cyberattacks on each other’s AI projects.
- China is preemptively paranoid about security, and by mid 2027, they only succeed at not going totally underground from US cyberattacks because of extreme siloing and airgapping.
- US security lags at first in 2027 due to their decentralization, but China’s cyber offense lags further due to their AIs being significantly worse at hacking. Through 2028 we’d tentatively expect both sides to be approaching a defense favoured limit.
- We base these estimates off the security polling result on partial subversion success and adjust success probabilities down to account for the fact that extreme model weights and extreme security are implemented. As an example, rather than modelling a 50% chance of self-exfiltration success by DeepCent’s December 2027 AI hacking horizon would imply, we have adjusted down to 10% given WSL5 security being in place.
- ^
Current AI benchmarks have many issues that limit their usefulness in assessing real-world performance and guiding safe development. One major problem is benchmark saturation: once a benchmark becomes popular, models often overfit to it, sometimes by directly training on test data or through exposure to similar data during pretraining. This leads to inflated scores that don't reflect genuine understanding or general capability. Benchmarks are static and can't adapt as models improve, meaning they quickly become obsolete and fail to differentiate between cutting-edge systems. There's also often a mismatch between benchmarks and real-world tasks. It seems that many benchmarks rely on narrow, multiple-choice questions that don’t capture the complexity, interactivity, or goal-driven nature of actual applications.
- ^
Early on, no feasible amount of computing power or money could get you to a target performance level if you're using an old model like GPT-2. This model simply doesn’t have the architecture or scale to do these complex tasks, no matter how much you run it. So the cost is effectively infinite for hitting that target with GPT-2. As model capabilities improve (e.g. GPT-4, GPT-5...), it becomes cheaper and easier to hit that same performance level, because more capable models require less compute per unit of performance. This improvement can’t go on forever. At some point, physical limits (speed of light, thermodynamics, chip density) will mean that cost reductions plateau. So the curve asymptotes. The cost drops fast at first, then flattens.
- ^
- ^
The Scaling Hypothesis - gwern
- ^
Takeaway from Gwern's comment: saying “reward is not the optimization target” oversimplifies. Even model-free training can yield policies that effectively do optimize reward due to emergent behavior via meta-learning. Consider the entire system — data, environment, architecture, compute — not just the surface-level algorithm.
- ^
"Identity circuitry" in this context refers to internal model structures — i.e., specific neural pathways or weight configurations — that encode goals, values, or decision-making tendencies that remain stable across training, even as the model becomes more capable. I think this matters a lot for alignment; if a model learns deceptive tendencies early (e.g., reward hacking), and those become part of its identity circuitry, then later training might refine its deception (not fix it).
- ^
Just because a certain model configuration (point B) would perform better in terms of reward or loss doesn’t mean stochastic gradient descent (SGD) or the training process will reach it from the model’s current state (point A). The model becomes more capable over time (e.g., better language skills, better planning) because these traits are directly rewarded or reinforced. But, its underlying goal structure, inclinations, or defaults don’t shift much, because there’s no strong gradient pushing toward a better or more aligned structure. The training process doesn’t just care what’s best — it cares what’s reachable via gradient descent from where it currently is. That can lead to capability generalization without goal generalization.
- ^
OCX refers to a classification of adversarial threat levels based on operational capacity, as defined by the RAND Corporation. Specifically, OCX denotes the level of sophistication and resources an attacker possesses, ranging from OC1 (amateur hackers) to OC5 (top-priority operations by the world's most capable nation-states).
1 comments
Comments sorted by top scores.