What’s the short timeline plan?
post by Marius Hobbhahn (marius-hobbhahn) · 2025-01-02T14:59:20.026Z · LW · GW · 30 commentsContents
Short timelines are plausible What do we need to achieve at a minimum? Making conservative assumptions for safety progress So what's the plan? Layer 1 Keep a paradigm with faithful and human-legible CoT Significantly better (CoT, action & white-box) monitoring Control (that doesn’t assume human-legible CoT) Much deeper understanding of scheming Evals Security Layer 2 Improved near-term alignment strategies Continued work on interpretability, scalable oversight, superalignment & co Reasoning transparency Safety first culture Known limitations and open questions None 30 comments
This is a low-effort post. I mostly want to get other people’s takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.
I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast [LW · GW]), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).
If we take AGI seriously, I feel like the AGI companies and the rest of the world should be significantly more prepared, and I think we’re now getting into the territory where models are capable enough that acting without a clear plan is irresponsible.
In this post, I want to ask what such a short timeline plan could look like. Intuitively, if an AGI lab came to me today and told me, “We really fully believe that we will build AGI by 2027, and we will enact your plan, but we aren’t willing to take more than a 3-month delay,” I want to be able to give the best possible answer. I list some suggestions but I don’t think they are anywhere near sufficient. I’d love to see more people provide their answers. If a funder is interested in funding this, I’d also love to see some sort of “best short-timeline plan prize” where people can win money for the best plan as judged by an expert panel.
In particular, I think the AGI companies should publish their detailed plans (minus secret information) so that governments, academics, and civil society can criticize and improve them. I think RSPs were a great step in the right direction and did improve their reasoning transparency and preparedness. However, I think RSPs (at least their current versions) are not sufficiently detailed and would like to see more fine-grained plans.
In this post, I generally make fairly conservative assumptions. I assume we will not make any major breakthroughs in most alignment techniques and will roughly work with the tools we currently have. This post is primarily about preventing the worst possible worlds.
Short timelines are plausible
By short timelines, I broadly mean the timelines that Daniel Kokotajlo has talked about for years (see e.g. this summary post [LW · GW]; I also found Eli Lifland’s argument [LW(p) · GW(p)] good; and liked this and this [LW · GW]). Concretely, something like
- 2024: AIs can reliably do ML engineering tasks that take humans ~30 minutes fairly reliably and 2-to-4-hour tasks with strong elicitation.
- 2025: AIs can reliably do 2-to-4-hour ML engineering tasks and sometimes medium-quality incremental research (e.g. conference workshop paper) with strong elicitation.
- 2026: AIs can reliably do 8-hour ML-engineering tasks and sometimes do high-quality novel research (e.g. autonomous research that would get accepted at a top-tier ML conference) with strong elicitation.
- 2027: We will have an AI that can replace a top researcher at an AI lab without any losses in capabilities.
- 2028: AI companies have 10k-1M automated AI researchers. Software improvements go through the roof. Algorithmic improvements have no hard limit and increase super-exponentially. Approximately every knowledge-based job can be automated.
- These are roughly the “geniuses in datacenters” Dario Amodei refers to in his essay “Machines of Loving Grace.”
- There are still some limits to scaling due to hardware bottlenecks.
- 2029: New research has made robotics much better. The physical world doesn’t pose any meaningful limits for AI anymore. More than 95% of economically valuable tasks in 2024 can be fully automated without any loss in capabilities.
- 2030: Billions of AIs with superhuman general capabilities are integrated into every part of society, including politics, military, social aspects, and more.
I think there are some plausible counterarguments to these timelines, e.g. Tamay Besiroglu argues that hardware limitations might throttle the recursive self-improvement of AIs. I think this is possible, but currently think software-only recursive self-improvements are plausible. Thus, the timelines above are my median timelines. I think there could be even faster scenarios, e.g. we could already have an AI that replaces a top researcher at an AGI lab by the end of 2025 or in 2026.
There are a couple of reasons to believe in such short timelines:
- The last decade of progress: Since GPT-2, progress in AI has probably been significantly faster than most people anticipated. Most benchmarks get saturated quickly. And so far, I have not seen strong evidence of reduced benefits of scale. If progress continues roughly at this pace, I would intuitively assume the timelines above.
- AGI companies’ stated timelines: Multiple AI companies have repeatedly stated that they expect roughly the timelines I describe above. While they have some incentive to overplay their capabilities, they are closest to the source and know about advances earlier than the general public, so I think it’s correct to weigh their opinions highly. I’d also note that multiple people who have decided to leave these companies and thus have a reduced incentive to overplay capabilities still continue to argue for short timelines (e.g. Daniel Kokotajlo or Miles Brundage).
- No further major blockers to AGI: I think the recipe of big pre-training run + agent scaffolding + extensive fine-tuning (including RL) will, by default, scale to a system that can automate top AI researchers. There will be engineering challenges, and there might be smaller blockers, e.g. how these systems can track information over longer periods of time (memory), but I don’t expect any of these to be fundamental. I previously expected that effective inference-time compute utilization might pose a blocker (~30%), but o1 and o3 convinced me that it likely won’t.
In conclusion, I can understand why people would have longer median timelines, but I think they should consider the timelines described above at least as a plausible scenario (e.g. >10% likely). Thus, there should be a plan.
What do we need to achieve at a minimum?
There is a long list of things we would love to achieve. However, I think there are two things we need to achieve at a minimum or the default scenario is catastrophic.
- Model weights (and IP) are secure: By this, I mean SL4 or higher, as defined in the report on securing model weights at the time we reach the capability level where an AI can replace a top researcher. If the model weights aren’t secure, either a bad actor could steal them or the model itself could self-exfiltrate. In both cases, most other alignment efforts would be drastically less impactful because there is no way of guaranteeing that they are actually being used. Model security more broadly includes scenarios where a malicious actor can do meaningfully large fine-tuning runs within the AI developer, not just exfiltration. It also includes securing algorithmic secrets to prevent malicious actors from catching up on their own (this likely requires a somewhat different set of interventions than securing model weights).
- The first AI that significantly speeds up alignment research isn’t successfully scheming: At some point, there will be an AI system that significantly (e.g. 10x) speeds up any kind of research. This AI system will likely/hopefully be used for AI safety research. If this AI system is scheming and we’re unable to detect it, or we’re unable to prevent the scheming plans from succeeding (Control [? · GW]), I expect the default outcome to be bad. Thus, a large part of this post is concerned with ensuring that we can trust the first sufficiently powerful AI system and thus “bootstrap” alignment plans for more capable systems.
The default scenario I have in mind here is broadly the following: There is one or a small number of AGI endeavors, almost certainly in the US. This project is meaningfully protected by the US government and military both for physical and cyber security (perhaps not at the maximal level of protection, but it’s a clear priority for the US government). Their most advanced models are not accessible to the public. This model is largely used for alignment and other safety research, e.g. it would compress 100 years of human AI safety research into less than a year. This research is then used to ensure that the next system is sufficiently aligned/controlled and that the model weights continue to be secure.
This feels like a minimal stable solution. Of course, I would prefer if we were more prepared. I also think there are other paths than building more aligned & powerful successors that AI developers could take with human-level AI systems. However, for this post, I will make the conservative assumption that nobody is willing to stop racing. Thus, conditional on a capability race, I’d want them to use their first AGIs to do lots of alignment, control, and security research.
There is a very long list of further assumptions and desiderata that I don’t elaborate on in this post, e.g.
- Leaders in the labs are trustworthy enough and would actually use this first AI system for alignment and other safety research.
- There is no war in the run-up to AGI that would derail the project, e.g. by necessitating that most resources be used for capabilities instead of safety research.
- There is sufficiently good international AI governance.
- The training runs are not sabotaged by external actors, e.g. other governments.
- …
I want as many of those assumptions to hold. However, in this post, I want to primarily focus on the most minimal version.
Making conservative assumptions for safety progress
If we only have 2-3 years until AGI, I think it’s reasonable to make conservative assumptions about AI safety progress. Concretely:’
- No major scientific breakthroughs in interpretability, scalable oversight, or super alignment. We will make continued improvements for all of these techniques, but none of them will be strong enough that they provide sufficient confidence on their own. They are, at best, part of a larger defense-in-depth strategy as supporting arguments.
- We roughly have ++ versions of the tools we currently have, e.g. evals, control, basic alignment training techniques, monitoring, etc.
- There are no major changes in governance. Maybe some of the voluntary commitments will be mandated and third-party auditing is required but there are no major international treaties or bigger legislation in the US and UK.
- There are no major paradigm changes. AGI is based on a scaffolded transformer that is trained with a lot of RL. Some new bells and whistles, such as improved memory techniques, are added, but there is no change in the core architecture or training regime. It is possible that the CoT is not human-legible or faithful anymore after these changes.
I think it is plausible that we will make a lot of alignment progress if we get to the aligned-enough human-level AI systems (i.e. roughly 2028 in my timeline above). This post is mostly concerned with how we get to this point so that we are set up for success and don’t create irreversible catastrophic harm on the way.
So what's the plan?
The following is my best guess for what we should do AT LEAST, but I haven’t thought about it deeply and would hope that the current plans of everyone who’s trying to build AGI are significantly better. This is very much a “throw the kitchen sink at the problem” approach with almost no theoretical grounding. I think a better version of this approach would do a few things differently, e.g.
- Explicit assumptions: It should make more concrete explicit assumptions about where the risk comes from, how much risk we’d expect, what our safety budget is, and what risks we’d be willing to take.
- Defense in depth: The current version is mostly “let’s try different strategies and stack them to reduce risks.” A better version of the plan would choose these different efforts as an explicit defense-in-depth approach where we have strong evidence that the measures are uncorrelated or even anti-correlated such that they cover each other’s weaknesses.
- Consistency: I think the outlined plan is inconsistent in a bunch of ways, e.g. if we assume that the AI company is not willing to pause for more than 3 months, they might also not be willing to accept higher costs for transparent CoT.
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. The second layer is still very important and reduces risk substantially, but is a lower priority than the first layer, in my opinion. Note that this is already very triaged, i.e. there is a third, fourth, and fifth layer that I think a great plan would contain that I don’t even mention here because the post explicitly makes very conservative assumptions about the success of various alignment strategies.
Intuitively, if we were building a car, this plan is the “could we at least add breaks”-plan that doesn’t even consider techniques like airbags, ABS, ESC, seatbelts, or crumple zones.
Layer 1
Everything in this section seems very important to me. In short,
- We should try hard to keep a paradigm with faithful and human-legible CoT
- If we have a paradigm with faithful and legible CoT (where the internal and external reasoning align), our main defense should be based on very detailed monitoring of reasoning and actions.
- If we have a paradigm without faithful and legible CoT, our main defense should be based on control (that doesn’t assume human-legible CoT).
- In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
- In all worlds, we should understand why, how, and when models scheme significantly better than we do now.
- For all of the above, we need strong evals. Both in order to gather evidence whether our monitoring and control techniques catch the most important cases, as well as to get a better sense of models’ capabilities and propensities more broadly.
- Under all circumstances, physical and cyber security needs to be high.
Keep a paradigm with faithful and human-legible CoT
In the grand scheme of things, we got very lucky that a large part of the reasoning of current AI systems is directly visible and understandable in plain sight (assuming that the hidden CoT in o1 and o3 is human-legible).
By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions. It also excludes situations where the model does a lot of internal reasoning that differs from what the CoT states.
At a minimum, I think we should
- All agree to keep this an important consideration for further development, e.g. be willing to take relevant capability hits (e.g. 2x higher cost for normal applications and up to 5x for high-stakes applications) if they lead to more understandable CoT.
- Focus more effort on research on legible and faithful CoT. I feel like there are still a lot of low-hanging fruits to pick, e.g. building a suite of hundreds of evaluations to test whether we’d consider any particular model’s CoT as faithful or to understand which side effects directly optimizing for legibility has (such as hiding the undesired reasoning instead of removing it).
- Deprioritize efforts to remove the visible CoT from our architecture, such as Meta’s COCONUT.
Assuming the legibility tax isn’t that large (e.g. less than 2x), I feel like we’d be shooting ourselves in the foot if we were to build AGI that doesn’t use human-legible CoT from the point where we are now even under these conservative assumptions.
I think my view would be significantly changed if there was evidence that high-compute RL, as used in o1 and o3, leads to non-legible CoT, e.g. where the model develops its own language that is not trivially comprehensible to us. In such a world, I would want
- The AI developers to share this information as soon as possible. I think the message “my model is a really good reasoner AND we don’t understand its reasoning” is significantly scarier than “my model is a really good reasoner BUT we understand its reasoning.”
- This information should change what research and governance strategies should be prioritised. For example, I would want to put signficant effort into decoding whatever language the model has learned to use.
Relatedly, Daniel Kokotajlo has recently written about a concrete proposal in this direction in “Why Don't We Just... Shoggoth+Face+Paraphraser? [LW · GW]”
Significantly better (CoT, action & white-box) monitoring
In the world where we have models with human-legible CoT, my main line of defense would go through really detailed monitoring (and control that doesn’t assume human-legible CoT would be the second line of defense).
Better monitoring seems like an incredibly cost-effective yet neglected approach. Clearly, it would be desirable to have multiple simple monitors analyzing model outputs for a number of generic failure modes and bad behaviors at all times. The kinds of behaviors we could be tracking could, for example, include:
- Different types of scheming, e.g. whether the model is hiding a particular goal, attempting to reduce oversight, taking actions that it knows to be not in the user's interest, and more.
- Different types of power-seeking, e.g. whether they try to acquire resources, try to reduce their access restrictions, try to persuade others in some systematic way, and more.
- Strong broad goal-directedness, e.g. when models appear to strongly pursue a specific goal across many context windows. This requires a comparison between contexts.
There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini) or simple white-box probes trained to detect these problematic behaviors (e.g. linear probes or SAE probes). I would also want to both monitor the actions itself (e.g. tool calls) as well as the reasoning that led to them.
I expect that some model providers are already using online black box monitors for jailbreaks, so there already is a proof of concept. Simple linear probes are incredibly cost-effective (just a simple vector product) and we could run thousands of probes in parallel at almost zero additional overhead.
While these seem like simple approaches, I think nobody has really tried very hard to make them work in practice. Furthermore, while they are conceptually simple, they still require a lot of iteration and tuning, e.g. we would need to create good datasets to train and evaluate our probes and calibrate the thresholds such that they trigger in the correct cases but don’t provide too many false positives. Especially in the case where it is hard to build really good monitoring systems, we should know that as soon as possible so that AI developers can put more effort into them.
I think every AI lab should have a full team just to build these monitors, make sure they work as intended, analyze their outcomes, make them efficient to use, and more. I would prioritize efforts to build high-sensitivity monitors for high-stakes situations, e.g. developer-internal agentic deployments, over good monitors for average-stakes deployments, e.g. most chatbot use-cases.
My general feeling is that these monitors would both directly prevent harm but also unlock a lot of other benefits, e.g. we could understand the effectiveness of our alignment techniques faster and more thoroughly.
Control (that doesn’t assume human-legible CoT)
In the world where we have models without human-legible CoT, my main line of defense would go through control that doesn’t assume human-legible CoT (and monitoring would be the second line of defense).
I specifically mean the Redwood Research definition of control, where we make the worst-case assumption that the model is scheming and it is not necessarily visible to us in CoT. We then attempt to develop protocols that maximize the usefulness of these models while minimizing risks.
While CoT monitoring might be useful for some control approaches, I specifically want to see further development of techniques that don’t assume that the CoT is legible to humans, as done by Redwood Research.
In general, I think the kind of work required for control is something like what Redwood has been doing just 100x more and applied to frontier systems by the AGI labs. Some concrete suggestions include:
- Agents: We want control to work on frontier AI agents. Prior work has focused on minimally agentic systems or agents on smaller tasks. I think labs should aim to work on applying their control techniques on their most advanced systems as fast as possible to start getting practical experience with it.
- Realism: Apply control in the kind of situations where we expect that we will need it the most, e.g. frontier LM agents doing AI R&D and whatever else frontier AI companies are using their AI systems internally.
- Rules for blue & red teams: I think there are still a lot of open questions around what the blue and red teams are allowed to do for control protocols. Under some variants, the red team is completely overpowered and can almost certainly find a winning strategy, e.g. when it has access to all the knowledge of the blue team and is allowed to train against the blue team methods. While very conservative assumptions might be totally justified in relevant cases, I think we should be much more precise with what reasonable assumptions for the teams look like across different cases. My guess is that “reasonableness” can be derived from deployment assumptions, i.e. the red team is allowed to use anything that the model would have access to in a realistic deployment. In the case where the most conservative assumptions are correct in almost all cases, we should know this as soon as possible since this likely makes defense significantly harder.
- Relaxed assumptions: The assumptions of control are intentionally conservative, e.g. it assumes that the model is constantly scheming. We might be able to define tighter bounds that provide similar safety but add fewer constraints (for more, see Ryan Greenblatt’s writing, e.g. here [LW(p) · GW(p)]).
This is just a small subset of directions in control. Afaik there will be a significantly more detailed list of control projects and open problems soon.
Much deeper understanding of scheming
By the time we have AI systems that drastically speed up or replace core researchers at AI lab, I think we should have a much better understanding of scheming. Concretely, we should
- Detection tools: Have significantly better tools to detect scheming in the first place, e.g. a much larger suite of black-box evaluations and better white-box techniques.
- How scheming develops: Have a much better understanding of when and how scheming develops. Concretely, we should understand where power-seeking tendencies come from, how models learn non-myopic tendencies, and when models would be willing to scheme for these tendencies.
- Real-world propensities: Understand how often scheming-related behavior appears in practice. For example, I think it would be very insightful to analyze 100 million real model responses, especially in agentic settings, and see how often models do something like scheming in them. This seems like a low-effort, high-insight project (but requires access to a large amount of real deployment data).
- Hidden scheming: Understand which training methods lead to less scheming vs. less detectible scheming. For example, punishing your model every time it shows scheming behavior may lead to less scheming, or it might just get better at hiding it. My guess is that we could say much more about which situations prioritize one over the other if we studied it more.
- Prepare responses: Think about what we would do in response to different scheming-related evidence. For example, under which circumstances would we do nothing, retrain the model, or throw away the model entirely? Buck Shlegeris has written about this, e.g. here [LW · GW], here [LW · GW], and here [LW · GW].
- More capability evals: Understand how capable models are when they are scheming, e.g. how easy it is for them to subvert oversight or mess with their training in undesired and hidden ways. An example of this would be extended work on the sleeper agents model organisms.
I expect that a lot of this work is on model organisms, i.e. where researchers actively study one particular part of scheming in isolation similar to Sleeper Agents or Ryan Greenblatt+Anthropic’s work on Alignment faking in large language models.
Evals
Evals are a cornerstone of a lot of frontier safety commitments, and I think they are a useful evidence-gathering tool that can inform other strategies. Within this plan, evals fulfill two important roles.
- Support other strategies: Evals are the default tool to determine whether our monitoring or control approaches are working. My guess is that building the evals for these approaches is at least as much work as coming up with the approach itself. Given the stakes at hand, we would want to have a lot of confidence in these evaluations and have a large range of evals with broad coverage.
- General information gathering: Beyond the support that evals provide for the other components of this plan, they are very useful to give a general overview of the capabilities and propensities that the models have. Concretely, I think we should make progress on the following efforts:
- Capability evaluations: We should continue to build more and better capability evaluations, especially for AI R&D, scheming, CBRN, cyber, and persuasion. Having a detailed understanding of what current models are capable of continues to seem very important to me.
- Elicitation: How do we elicit maximal capabilities from a model such that we can make statements about approximate upper bounds? How can we quantify “how hard we tried” or how close we are to maximal elicitation?
- Predictive evals: What is the correct way to predict capabilities ahead of time? How can we know when we would start seeing signal on a capability benchmark that all models currently get 0 performance on? I’m optimistic about various ideas related to observational scaling laws, but I think there are many other good things to try.
- Alignment evals: So far, most efforts have been focused on capability evaluations. I think we now have to start shifting significantly more effort on alignment / propensity evaluations. More and more capability benchmarks show meaningful signal, so we need to start tracking propensities. By default, I would encourage people to build adversarial alignment evals, e.g. honeypots for power-seeking, scheming, non-myopic preferences, and more. Broadly speaking, we should be able to build a detailed “psychological profile” of any new model for their default tendencies, e.g. similar to the MACHIAVELLI benchmark but in much more detail.
- Demos: The information the researchers find has to be communicated to a range of people, e.g. other scientists, decision-makers within AI companies, and policymakers. For some of these, the correct form of communication is a rigorous paper that carefully considers all possible alternative explanations. For others, it is a salient demonstration in a setting that they (or their constituents) intuitively understand and care about. Demos can be the difference between a problem being perceived as theoretical and distant versus real and urgent.
Overall, I think there should be significantly more resources in evals, see e.g. the evals gap (other safety efforts should also be funded more).
Security
Improved information and physical security seems like an important layer of defense for many failure modes. We need to ensure that model weights and algorithmic secrets cannot get stolen, that models cannot exfiltrate themselves, and that no internal employees can run large, unmonitored finetuning runs. I think the report on securing model weights is great and we should basically treat its recommendations as a good default (though it’s plausible that more thinking would suggest a slightly different strategy). Buck Shlegeris’ post, “Access to powerful AI might make computer security radically easier,” also makes important recommendations.
I guess that the core challenge under short timelines will be correct prioritization. It is likely impossible to implement everything that would be suggested in the SL5 recommendations of the “securing model weights” report within the next 2-3 years. Thus, triaging seems necessary. I could also imagine that a lot of the security strategies that are important right around the point when an AI company automates their AI R&D have a few years of development time and might not feel immediately required right now, so they will only exist if an AI developer is acting now with a lot of foresight.
Finally, I’d have hoped that the internal infosec teams at frontier AI companies get access to frontier models ~3 months before external deployment both so they can defensively harden their infosec as well as conduct offensive testing. My guess is that this is not the norm, but in a sane world, an AI company would do something like using their own models to improve their own defensive capabilities before they release it. I expect this is largely blocked by inertia and company-internal politics. So, making this a norm as soon as possible would likely go a long way.
Unfortunately, I’m not an expert in security and would welcome concrete suggestions for a short-timeline security strategy from someone who is.
Layer 2
Everything in this layer still seems quite important to me, but it is less important than those in layer 1.
Improved near-term alignment strategies
I think RLHF and RLAIF empirically work better than I would have expected in ~2020. For example, they make the model more robustly behave as I want it to, the model can infer my intentions better, and the HHH training generalizes further than I had anticipated.
For example, I did not expect that the models would internalize HHH strongly enough to be willing to scheme for it, e.g. in Alignment faking in large language models and in our follow-up experiments on in-context scheming [LW · GW]. While I think this should be considered evidence that current alignment techniques go “deeper” than some people thought, it also opens up new problems.
Since our existing techniques can already introduce non-myopic and non-corrigible tendencies into the models, I think we should spend significantly more effort ensuring that these techniques keep the models myopic and corrigible. Furthermore, I think we should understand which kind of goals the models get in a lot more detail, e.g. in our follow-up experiments [LW · GW], it seemed that “the models just want to be maximally helpful” was not the best description of the goals revealed preferences and at least to some extent they also showed some self-preservation without this being instrumental to their helpfulness. I would like to understand this in much more detail and build fine-grained “cognitive profiles” that we closely track for various alignment techniques.
I also expect that the strategies for internal and external use should differ meaningfully. For example, making models jailbreak-resistant could decrease their corrigibility. However, if you intend to deploy the model only to highly trusted internal staff, you don’t have to train it against jailbreaks. I’m currently unsure which training techniques you would want to use for agentic long-horizon internal deployments, e.g. because jailbreaks might be less relevant but taking non-power-seeking actions is much more relevant.
Likely, the main bottleneck to success is building tests and environments that are stable enough that we’d be happy to iterate against them (not necessarily train against them, but at least regularly evaluate on them).
Continued work on interpretability, scalable oversight, superalignment & co
I like efforts going into improving fundamental research in interpretability, scalable oversight, superalignment & co (this is broadly the category for empirical research that is not covered in layer 1) and I want people to continue to make progress on the fundamentals. However, I think in a short-timeline world, figuring out the fundamentals first is too risky of a bet and I deprioritize them in this plan.
There are three types of efforts I would want to see in particular:
- Continued fundamental research: Continuous refinement of conceptual clarity, methods, and evaluation techniques but not on the use cases that we ultimately want them to use for. For example, “Debating with More Persuasive LLMs Leads to More Truthful Answers” uses a dataset that has ~zero safety relevance but is a good way to test and showcase the method.
- Horizontal applied bets: Use state-of-the-art tools in these fields and apply them to a use-case that strongly resembles the use-case we expect to find when there is human-level automated R&D. For example, this could be something like “train an SAE on our frontier model and try to make the best possible ‘flag-bad-behavior’-monitor we can build,” or “use the techniques from the debate paper on the environments in Sabotage Evaluations for Frontier Models.”
- Preparing for human-level AI R&D: I think it’s possible that short timeline logic changes how to prioritize research efforts. For example, it might be logical to almost exclusively do conceptual work and write 100 project proposals that can then be executed by the AIs instead of doing a lot of empirical research yourself (I haven’t thought this through in detail, so the counterarguments might be overwhelming).
My current guess is that none of these techniques are good enough that we would want to put a lot of confidence in them but I think they could be used as a supporting argument in an overall defense-in-depth approach. Of course, I’d love to be proven wrong and find out that any of them could be fully load-bearing.
Reasoning transparency
I think AI companies should make their reasoning around safety plans for human-level AGI more transparent. For example, voluntary commitments like safety frameworks were a good step in the right direction. They allowed externals to understand the broad strokes of these plans and criticize them. I think not every criticism was accurate or helpful, but I expect it to be highly net positive. I also liked Anthropic's “Three Sketches of ASL-4 Safety Case Components” and would like to see more efforts in this direction.
Realistically, I think the priorities should be roughly:
- Communicate the plan internally: Whatever the best current version of the plan is should be communicated with all important decision-makers within the organization. If the plan is not very developed, that should be communicated as well. In the best case, I think the internal plan should be fully transparent and explicit about assumptions like the safety budget, risk the organization is implicitly willing to take, clear red lines, etc. For example, if the risk tolerance of the leadership is much higher than that of the average employee, then employees should be aware of that, in my opinion.
- Communicate the plan with external experts: I think there are big benefits to communicating these plans with external experts, e.g. from independent organizations, academia, or AI safety institutes. This has many reasons, e.g. specific expertise, uncorrelated thinking, different incentives, and more. Again, I think it would be sensible to be fully transparent with the external experts and e.g. have NDAs with them.
- Communicate the plan with the general public: Morally speaking, I think companies should share their plans in quite a lot of detail with the public. If your organization states that they are building a technology as powerful as AGI that affects approximately everyone, I think it would be reasonable for approximately everyone to be able to understand why they should or shouldn’t expect this technology to be safe. In practice, I understand that PR is typically risk-averse and thus, I don’t expect this to happen in the scope I’d like it to. Even under PR constraints, I think it is entirely possible for an AI developer to publish something significantly more detailed than the Anthropic RSP or safety case sketch, and would like to see that.
Safety first culture
I think an organization that aims to build AGI should ideally have internalized a safety-first approach at all levels. Everyone, from leadership to interns, has to understand the magnitude of what they’re working on and the implications if they make mistakes. The kind of culture I have in mind can likely only be found in militaries today, if at all. I think we’re currently very far away from that culture and not on track to reach it.
A few examples of my desired culture include:
- Safety has to be deeply integrated into the entire development process. I think some AI companies often treat these as separate processes where they first have a “capabilities phase” where they build the most capable model and then a “safety phase” where they apply their safety procedures afterward. With more capable models, this is incredibly risky since the most capable models might already cause significant harm during development, e.g. attempt to exfiltrate their weights. At a minimum, I think AI companies should move toward continuous safety testing during development, such as running evals at intermediary checkpoints to track behavior over time.
- I think there should be an obvious understanding within the organization that new capability advances are first used for safety, e.g. for cybersecurity hardening. These norms should be established now, not only when we reach these capabilities.
- Treat safety as an asset. The more parts of the AI development process can be automated, the more important a safety mindset becomes. Since capability gains are less reliant on humans, integrity will be a very valuable resource for most human researchers. At the point where we have reliable automated researchers and engineers, some seemingly unconventional actions might suddenly become rational. For example, it might be reasonable to fire the N% least trustworthy employees and instead give the remaining (100-N)% more access to compute. The risks of reckless actions just become too big at that point.
I think it’s plausible that the question of whether AI will be the best or worst technology that humanity has built may come down to a few high-stakes decisions that are made with little time and under high pressure. I find it hard to see a world where we make the best decisions unless safety is top of mind for all people involved.
Assuming short timelines, I think it’s likely impossible to reach my desired levels of safety culture. This implies that
- We should triage: If I had to prioritize, I think the most important part of the culture should be that the leadership is in agreement on what evidence they would need to see to pause any further development and actively seek that evidence. Furthermore, they should encourage a culture where pointing out potential safety flaws is seen as expected & positive rather than hindering & negative.
- Think about plans that assume that the safety culture is insufficient: In practice, this likely looks like significant restrictions on most employees since the default assumption is that you cannot trust their decision-making. This is likely bad for employee happiness but might be required in the absence of a military-level safety culture.
Known limitations and open questions
I already list some of the limitations above, but want to give a bit more context here.
- Generic improvements: As I said in the beginning, this plan should really be seen more as an encouragement to come up with a better plan than a particularly good plan in itself. I’d be particularly interested in missing pieces of layer 1 or people who would prioritize substantially differently, assuming short timelines. I’m less interested in feedback like “If we had substantially more time, I’d do X.”
- Explicit assumptions: A better version of this post would make more concrete explicit assumptions about where the risk comes from, how much risk we’d expect, what our safety budget is, and what risks the AI developer would be willing to take. I think a better plan would have a much stronger connection between these high-level assumptions and the concrete actions taken. For example, I think it would be interesting for someone to do the exercise of “assume our safety budget is X and our risk is Y, which actions seem feasible, and how much risk reduction do we expect to get from this budget compared to not doing anything at all.” This will be hard to quantify but even basic guesstimates would already be informative.
- Defense in depth: The current version is mostly “let’s try different strategies and stack them to reduce risks.” A better version of the plan would choose these different efforts as an explicit defense-in-depth approach where we have strong evidence that the measures are uncorrelated or even anti-correlated such that they cover each other’s weaknesses. I’d be very interested in empirical experiments to compare the failure modes of different approaches in practice and see where and how they are correlated. For example, I’d be keen to know whether the failures of black-box monitors and simple white-box probes are correlated or not.
- Details: The proposed plan is just a high-level sketch. Every single section could be its own 50-page report. I’d love to see really detailed write-ups for each section. For example, I’d be interested to see a really detailed write-up about which monitors to build, how to prioritize them, how to measure their efficacy, how to make them more efficient, and all the other little details that make things hard in practice.
30 comments
Comments sorted by top scores.
comment by Akash (akash-wasil) · 2025-01-02T20:35:53.105Z · LW(p) · GW(p)
Big fan of this post. One thing worth highlighting IMO: The post assumes that governments will not react in time, so it's mostly up to the labs (and researchers who can influence the labs) to figure out how to make this go well.
TBC, I think it's a plausible and reasonable assumption to make. But I think this assumption ends up meaning that "the plan" excludes a lot of the work that could make the USG (a) more likely to get involved or (b) more likely to do good and useful things conditional on them deciding to get involved.
Here's an alternative frame: I would call the plan described in Marius's post something like the "short timelines plan assuming that governments do not get involved and assuming that technical tools (namely control/AI-automated AI R&D) are the only/main tools we can use to achieve good outcomes."
You could imagine an alternative plan described as something like the "short timelines plan assuming that technical tools in the current AGI development race/paradigm are not sufficient and governance tools (namely getting the USG to provide considerably more oversight into AGI development, curb race dynamics, make major improvements to security) are the only/main tools we can use to achieve good outcomes." This kind of plan would involve a very different focus.
Here are some examples of things that I think would be featured in a "government-focused" short timelines plan:
- Demos of dangerous capabilities
- Explanations of misalignment risks to senior policymakers. Identifying specific people who would be best-suited to provide those explanations, having those people practice giving explanations and addressing counterarguments, etc.
- Plans for what the "trailing labs" should do if the leading lab appears to have an insurmountable lead (e.g., OpenAI develops a model that is automating AI R&D. It becomes clear that DeepMind and Anthropic are substantially behind OpenAI. At this point, do the labs merge and assist? Do they try to do a big, coordinated, costly [LW(p) · GW(p)] push to get governments to take AI risks more seriously?)
- Emergency preparedness– getting governments to be more likely to detect and appropriately respond to time-sensitive risks.
- Preparing plans for what to do if governments become considerably more concerned about risks (e.g., preparing concrete Manhattan Project or CERN-for-AI style proposals, identifying and developing verification methods for domestic or international AI regulation.)
One possible counter is that under short timelines, the USG is super unlikely to get involved. Personally, I think we should have a lot of uncertainty RE how the USG will react. Examples of factors here: (a) new Administration, (b) uncertainty over whether AI will produce real-world incidents, (c) uncertainty over how compelling demos will be, (d) chatGPT being an illustrative example of a big increase in USG involvement that lots of folks didn't see coming, and (e) examples of the USG suddenly becoming a lot more interested in a national security domain (e.g., 9/11--> Patriot Act, recent Tik Tok ban), (f) Trump being generally harder to predict than most Presidents (e.g., more likely to form opinions for himself, less likely to trust the establishment views in some cases).
(And just to be clear, this isn't really a critique of Marius's post. I think it's great for people to be thinking about what the "plan" should be if the USG doesn't react in time. Separately, I'd be excited for people to write more about what the short timelines "plan" should look like under different assumptions about USG involvement.)
Replies from: marius-hobbhahn, otto-barten, milosal↑ comment by Marius Hobbhahn (marius-hobbhahn) · 2025-01-02T21:59:33.287Z · LW(p) · GW(p)
I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.
↑ comment by otto.barten (otto-barten) · 2025-01-03T01:42:58.612Z · LW(p) · GW(p)
I think that if government involvement suddenly increases, there will also be a window of opportunity to get an AI safety treaty passed. I feel a government-focused plan should include pushing for this.
(I think heightened public xrisk awareness is also likely in such a scenario, making the treaty more achievable. I also think heightened awareness in both govt and public will make short treaty timelines (a year to weeks), at least between the US and China, realistic.)
Our treaty proposal (a few other good ones exist): https://time.com/7171432/conditional-ai-safety-treaty-trump/
Also, I think end games should be made explicit: what are we going to do once we have aligned ASI? I think that's both true for Marius' plan, and for a government-focused plan with a Manhattan or CERN included in it.
↑ comment by MiloSal (milosal) · 2025-01-03T22:56:22.476Z · LW(p) · GW(p)
Akash, your comment raises the good point that a short-timelines plan that doesn't realize governments are a really important lever here is missing a lot of opportunities for safety. Another piece of the puzzle that comes out when you consider what governance measures we'd want to include in the short timelines plan is the "off-ramps problem" that's sort of touched on in this post [LW · GW].
Basically, our short timelines plan needs to also include measures (mostly governance/policy, though also technical) that get us to a desirable off-ramp from geopolitical tensions brought about by the economic and military transformation resulting from AGI/ASI.
I don't think there are good off-ramps that do not route through governments. This is one reason to include more government-focused outreach/measures in our plans.
comment by Seth Herd · 2025-01-02T18:58:13.842Z · LW(p) · GW(p)
Thanks for writing this. I very much agree that we should have more work on actual plans for short timelines - even if it's only guessing what vague plans orgs and governments are likely to create or follow by default.
Arguments for short timelines are speculative. So are arguments for longer timelines. It seems like the reasonable conclusion is that both are possible. It would seem urgent to have a plan for short timelines.
I also think that plan should be collaborative. So here's some of my contribution, framed as a suggestion for improvement, per your invitation:
This plan only briefly addresses actual alignment techniques. Plans should probably address it more carefully. Control only works until it fails, and CoT transparency and eval measures only work if we stop deploying capable, scheming models. As Buck argues [LW(p) · GW(p)], and you imply, we might well see powerful models that are obviously scheming, but see orgs (or government(s)) press on because of race dynamics.
So it seems like alignment is a key piece of the puzzle even by the time we reach human-researcher level. Human researchers have agency and can solve novel problems. Researcher systems may solve the novel problem "so what do my goals really imply"? Having less agency and flexible problem-solving seems like it would severely hamper capabilities, creating a large alignment tax that decision-makers will reject under race dynamics.[1]
Then, a more detailed but still sketchy proposal for an alignment plan for short timelines:
I agree that RLHF and RLAIF (and now Deliberative Alignment) are working better than expected. Perhaps they'll be adequate to get us to the point where your predictions stop. But I doubt they get us that far. At the least, this probably deserves a good bit more analysis. That's what I'm trying to do in my work: anticipate how alignment techniques will work (or not work) as LLMs are made smarter and more agentic. I summarize some likely alignment approaches for language model agents here [AF · GW], and am working on an updated list of techniques that are obvious and low-tax enough that they're likely to actually be used.
The alignment target those techniques aim at is also crucially important.
As you say:
For example, making models jailbreak-resistant could decrease their corrigibility. [...] I’m currently unsure which training techniques you would want to use for agentic long-horizon internal deployments, e.g. because jailbreaks might be less relevant but taking non-power-seeking actions is much more relevant.
I'm also unsure.
I think this issue is why agent foundations type thinkers usually don't have a plan for short timelines; they assume that in short timelines, we die. They think alignment will fail because we don't have a way to specify alignment targets precisely in the face of strong optimization. I agree with this but think adequate corrigibility is attainable and can avoid that problem.
We are building increasingly powerful AI with its own goals (like resisting jailbreaks and being "helpful, harmless, and honest" by some definition and training procedure). When those models become fully agentic and strongly optimize against those goals, we may find we didn't define them quite carefully enough. (I'd say strong optimization on the goals of current LLMs would probably doom us, but at the least it seems we should have a strong argument that it wouldn't).
I think those agent foundations thinkers incorrectly assume that corrigibility is anti-natural. That's only true when you try to employ it as an add-on to some other primary goal instead of as a singular alignment target [LW · GW] Thus I think the obvious decision is to preserve corrigibility by dropping the alignment goal of internal ethics (like refusing harmful information requests) and emphasizing strict instruction-following, to preserve corrigibility. By this argument, Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW].
Current alignment techniques directed at that corrigibility/IF might well get us safely to AGI levels that can invent and improve alignment techniques, and thereby get us to aligned ASI.
But this leaves aside the pressure to actually deploy your AGI, and thereby make a ton of money. If you're selling its services to external users, you're going to want it to refuse harmful instructions from that lower tier of authorization. That reduces corrigibility if you do it through training, as you note. I don't know how to balance those goals; having the authorized superuser give instructions to refuse harmful requests is the obvious route, but it's hard to say if that's adequate in practice.
This also risks misuse when those systems are stolen or replicated. If we solve alignment, do we die anyway? [LW · GW] After that discussion, I think the answer is no, but only if some government or coalition of them wisely decide to limit AGI proliferation - and offset the resultant global outrage by sharing the massive benefits from AGI with everyone.
That addresses the other limitation of this plan - it only gets us to the first aligned AGI, without addressing what happens when other actors steal or reproduce it. I think a complete plan needs to include some way we don't die as a result of AGI proliferation to bad actors.
- ^
The claim that highly useful AI systems will be capable of solving novel problems, and that narrowing their scope or keeping them myopic will be a high alignment tax is load-bearing. I haven't produced a really satisfactory compact argument for this. The brief argument is that humans achieve our competence by using our general intelligence (which LLMs copy) to solve novel problems and learn new skills. The fastest route to AGI is to expand LLMs to do those things very similarly to how humans do them. Creating specialist systems requires more specialized work for each domain, and human debugging/engineering when systems can't solve the new problems that domain entails. More in my argument for short timelines [LW(p) · GW(p)] and in Capabilities and alignment of LLM cognitive architectures [LW · GW]
comment by Zach Stein-Perlman · 2025-01-02T22:15:51.703Z · LW(p) · GW(p)
Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).
See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.
comment by Rohin Shah (rohinmshah) · 2025-01-04T09:25:56.121Z · LW(p) · GW(p)
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1 [...] Everything in this section seems very important to me [...]
1. We should try hard to keep a paradigm with faithful and human-legible CoT
[...]
4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn't do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn't successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don't think it's crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don't think a minimal stable solution involves most of the world's compute going towards alignment research.
To be clear, it's quite plausible that we want to do the actions you suggest, because even if they aren't literally necessary, they can still reduce risk and that is valuable. I'm just objecting to the claim that if we didn't have any one of them then we very likely get catastrophically bad results.
Replies from: marius-hobbhahn↑ comment by Marius Hobbhahn (marius-hobbhahn) · 2025-01-04T10:10:57.295Z · LW(p) · GW(p)
That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.
comment by JenniferRM · 2025-01-03T05:52:51.785Z · LW(p) · GW(p)
I got to this point and something in my head make a "thonk!" sound, and threw an error.
The default scenario I have in mind here is broadly the following: There is one or a small number of AGI endeavors, almost certainly in the US. This project is meaningfully protected by the US government and military both for physical and cyber security (perhaps not at the maximal level of protection, but it’s a clear priority for the US government). Their most advanced models are not accessible to the public.
The basic issue I have with this mental model is that Google, as an institution, is already better at digital security than the US Government, as an institution.
Long ago, the NSA was hacked and all its cool toys were stolen and recently it became clear that the Chinese Communist Party hacked US phones through backdoors that the US government put there.
By contrast, Google published The Interview (a parody of the monster at the top of the Un Dynasty) on its Google Play Store after the North Koreans hacked Sony for making it and threatened anyone who published it with reprisal. Everyone else wimped out. It wasn't going to be published at all, but Google said "bring it!"... and then North Korea presumably threw their script kiddies at Gog Ma Gog's datacenters and made no meaningful dent whatsoever (because there were no additional stories about it after that (which NK would have yelled about if they had anything to show for their attacks)).
Basically, Google is already outperforming state actors here.
Also, Google is already training big models and keeping them under wraps very securely.
Google has Chinese nationals on their internal security teams, inside of their "N-factor keyholders" as a flex.
They have already worked out how much it would cost the CCP to install someone among their employees and then threaten that installed person's families back in China with being tortured to death, to make the installed person help with a hack, and... it doesn't matter. That's what the N-factor setup fixes. The family members back in China are safe from the CCP psychopaths precisely because the threat is pointless, precisely because they planned for the threat and made it meaningless, because such planning is part of the inherent generalized adequacy and competence of Google's security engineering.
Also, Sergey Brin and Larry Page are both wildly more more morally virtuous than either Donald Trump or Kamala Harris or whichever new randomly evil liar becomes President in 2028 due to the dumpster fire of our First-Past-The-Post voting system. They might not be super popular, but they don't live on a daily diet of constantly lying to everyone they talk to, as all politicians inherently do.
The TRAGEDIES in my mind are this:
- The US Government, as a social system, is a moral dumpster fire of incompetence and lies and wasted money and never doing "what it says on the tin", but at least it gives lip service to moral ideals like "the consent of the governed as a formula for morally legitimately wielding power".
- The obviously superior systems that already exist do not give lip service to democratic liberal ideals in any meaningful sense. There's no way for me, or any other poor person on Earth, to "vote for a representative" with Google, or get "promises of respect of my human rights" from them (other than through court systems that are, themselves, dumpster fires (see again Tragedy #1)).
↑ comment by ryan_greenblatt · 2025-01-04T00:50:47.280Z · LW(p) · GW(p)
Also, Google is already training big models and keeping them under wraps very securely.
This seems straightforwardly false. The current status quo at Google as of earlier this year is <SL3 level security (not robust to normal insiders or to terrorists groups) based on what is said in their frontier model framework.
By contrast, Google published The Interview (a parody of the monster at the top of the Un Dynasty) on its Google Play Store after the North Koreans hacked Sony for making it and threatened anyone who published it with reprisal. Everyone else wimped out. It wasn't going to be published at all, but Google said "bring it!"... and then North Korea presumably threw their script kiddies at Gog Ma Gog's datacenters and made no meaningful dent whatsoever (because there were no additional stories about it after that (which NK would have yelled about if they had anything to show for their attacks)).
I don't think this is much evidence for the claim you say. Many parts of the US government are robust to North Korea and being robust to some North Korean operation is much easier than being robust to all chinese operations attacking a diverse range of targets.
Basically, Google is already outperforming state actors here.
I'm quite skeptical that Google is uniformly better than the goverment, though I agree parts of google are better than parts of government. It's hard to assign a net security competence rating.
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-02T18:38:02.181Z · LW(p) · GW(p)
Ok, I think you might be lumping this under "international governance", but I feel like you are severely underestimating the potential of open source AI. (Or off-brand secret projects based on open source AI). Right now, the big labs have a lead, and that lead is expected to grow temporarily in the short term since they have been scaling their compute to ranges out of reach to B-class players.
But... What happens when algorithmic innovation gets accelerated? It's already somewhat accelerated now. How secret are you going to manage to keep the findings about potential algorithmic improvements? It is a LOT easier for one untrustworthy employee to leak a concept about how to approach model architectures more efficiently than it is for them to steal the existing model weights.
Also, what do you do if existing open weights models at the time of international lockdowns on training runs are already sufficiently strong to be dangerous with only tweaks to the fine-tuning and RL protocols?
Is this so far outside your Overton window that you don't feel it's worth discussing?
Deepseek-V3 has 37B active params and 671 total params. Thats a ratio of about 18.14:1. What if someone made an MoE model using Llama 405B, with 405B active params and 7.4 trillion total params? What if a new pre-training run wasn't needed, just RL fine-tuning?
What about Llama 4, that is in progress? Will your international governance prevent that from being released?
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-02T18:53:18.294Z · LW(p) · GW(p)
The realistic answer for decision making purposes (not epistemic purposes) is mostly hope that offense-defense balances are good enough to prevent the end of the world, combined with accepting somewhat larger risk of misalignment to prevent open source from being bad.
To be kind of blunt, any scenario where AI progress is algorithm dominated and a world where basically everyone can train superintelligence without nations being able to control it and a world where timelines are short is a world where governance of AI is more or less useless, and alignment becomes a little more dicey, so we should mostly ignore such worlds for utility purposes (they are valid for prediction/epistemic purposes).
Replies from: nathan-helm-burger↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-02T18:57:21.601Z · LW(p) · GW(p)
But I'm approx 80% confident that that's the world we're in! I don't want to just give up in 80% of cases!
I think we need to try to figure out some empirical measures for how much like this the world is, and at least come up with plans for what to do about it. I don't think it's hopeless, just that the plans would be rather differently shaped.
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-02T19:15:05.811Z · LW(p) · GW(p)
I think we need to try to figure out some empirical measures for how much like this the world is, and at least come up with plans for what to do about it.
I basically agree with this.
A crux here is that I believe that AI governance is basically worthless if you cannot control the ability to create powerful AIs, because they will soon be able to create states within a state, and thus resist any governance of AI plan you might wish to implement.
In other words, you need something close to a monopoly on violence in order for a government to continue existing, and under scenarios where AIs get very powerful very quickly because of algorithms, there is no good way to control the distribution of AI power, and it's too easy for people to defect from governance on AI.
I'm not hopeless that such a world can survive. However, I do think AI governance breaks hard if we are in a future where AI power primarily is determined by algorithms, since I basically consider algorithmic advances mostly impossible to control.
I don't think it's hopeless, just that the plans would be rather differently shaped.
IMO, a plan for short timelines where algorithmic secrets are very powerful I would pivot hard towards technical safety/alignment, and would more or less abandon all but the most minimal governance/control plans, and the other big thing I'd focus on is on making the world's offense-defense balance much better than it is now, which means we'd have to invest more in making bio-tech less risky, with the caveat that we cannot assume the government will successfully suppress bio-threats from civilian actors.
comment by aysja · 2025-01-03T13:13:46.260Z · LW(p) · GW(p)
By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions.
Curious why you say this, in particular the bit about “accurately reflects the reasons for the model’s actions.” How do you know that? (My impression is that this sort of judgment is often based on ~common sense/folk reasoning, ie, that we judge this in a similar way to how we might judge whether another person took sensical actions—was the reasoning internally consistent, does it seem predictive of the outcome, etc.? Which does seem like some evidence to me, although not enough to say that it accurately reflects what's happening. But perhaps I'm missing something here).
Replies from: marius-hobbhahn↑ comment by Marius Hobbhahn (marius-hobbhahn) · 2025-01-03T15:05:16.891Z · LW(p) · GW(p)
I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).
These are all pretty basic thoughts and IMO we should invest significantly more effort into clarifying this as part of the "let's make sure CoT is faithful" part. A lot of safety strategies rest on CoT faithfulness, so we should not leave this to shallow investigations and vibes.
comment by Ryan Kidd (ryankidd44) · 2025-01-02T21:35:48.401Z · LW(p) · GW(p)
How would you operationalize a contest for short-timeline plans?
Replies from: marius-hobbhahn↑ comment by Marius Hobbhahn (marius-hobbhahn) · 2025-01-02T22:11:17.162Z · LW(p) · GW(p)
Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/
In general, I wouldn't make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some "saving" mechanism if an overeager expert wants to read plans that weren't assigned to them. Everything in the top N% then gets scored by all experts with a more detailed review. Then, there is a final ranking.
I'd hope that the time spent per expert is only 5-10 hours in total. I'd be fine with missing a bunch of posts that contain good ideas that are badly communicated or otherwise easy to miss on the shallow review.
My main goal with the contest would be that writing a good plan and communicating it clearly is incentivized.
↑ comment by Ryan Kidd (ryankidd44) · 2025-01-02T22:46:03.664Z · LW(p) · GW(p)
I'm tempted to set this up with Manifund money. Could be a weekend project.
Replies from: marius-hobbhahn↑ comment by Marius Hobbhahn (marius-hobbhahn) · 2025-01-03T08:44:26.802Z · LW(p) · GW(p)
Go for it. I have some names in mind for potential experts. DM if you're interested.
Replies from: domdomegg↑ comment by Adam Jones (domdomegg) · 2025-01-03T11:32:43.217Z · LW(p) · GW(p)
At BlueDot we've been thinking about this a fair bit recently, and might be able to help here too. We have also thought a bit about criteria for good plans and the hurdles a plan needs to overcome, as well as have reviewed a lot of the existing literature on plans.
I've messaged you on Slack.
comment by Adam Jones (domdomegg) · 2025-01-03T12:52:19.004Z · LW(p) · GW(p)
Thank you for writing this. I've tried to summarize this article (missing good points made above, but might be useful to people deciding whether to read the full article):
Summary
AGI might be developed by 2027, but we lack clear plans for tackling misalignment risks. This post:
- calls for better short-timeline AI alignment plans
- lists promising interventions that could be stacked to reduce risks
This plan focuses on two minimum requirements:
- Secure model weights and algorithmic secrets
- Ensure the first AI capable of alignment research isn't scheming
Layer 1 interventions (essential):
- AI systems should maintain human-legible and faithful reasoning.
- If achieved, we should monitor this reasoning, particularly for scheming, power-seeking, and broad goal-directedness (using other models or simple probes).
- If not, we should fall back on control techniques that assume the model might be scheming.
- Evaluations support other strategies, and give us better awareness of model alignment and capabilities.
- Information and physical security protects model weights and algorithmic secrets.
Layer 2 interventions (important):
- Continue improving 'current' alignment methods like RLHF and RLAIF.
- Maintain research on interpretability, oversight, and "superalignment", and preparing to accelerate this work once we have human-level AI R&D.
- Increase transparency in AI companies' safety planning (internally, with experts, and publicly).
- Develop a safety-first culture in AI organizations.
This plan is meant as a starting point, and Marius encourages others to come up with better plans.
comment by Raphael Roche (raphael-roche) · 2025-01-02T17:49:34.113Z · LW(p) · GW(p)
Thank you for this publication. Just an idea (maybe stupid): GPT-type models are stochastic by design. This characteristic might be used as a way to control them. Your study with Apollo Research illustrates this well—there is a certain probability that a given model will deceive users under specific circumstances and to varying degrees. However, this remains fundamentally random, non-deterministic, and that property is quite interesting. There is always a chance that deception does not occur, is clumsy, or stops at an early stage.
For a single response, it is impossible to determine whether deception is present or not. But across a large set of responses, it seems almost impossible for a model, due to its very design, to maintain a solid and coherent deceptive behavior across multiple parallel outputs. Thus, a study like yours serves as a very effective tool to detect such behaviors and could remain effective for increasingly advanced models. It would simply require enlarging (perhaps significantly) the response sample.
Moreover, this could be implemented as an automated testing framework inspired by your paper. This would make it possible to continue detecting deceptive behaviors with arbitrary effectiveness, as long as the model remains stochastic. Once such behaviors are detected and measured using tests of this kind, an automated RL phase could follow, aiming to reduce the tendency toward deception to very low levels.
Replies from: dan-braun-1↑ comment by Dan Braun (dan-braun-1) · 2025-01-02T18:26:33.343Z · LW(p) · GW(p)
I think the concern here is twofold:
- Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
- We can't rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.
Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We're seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).
Replies from: caleb-biddulph↑ comment by CBiddulph (caleb-biddulph) · 2025-01-02T19:41:48.190Z · LW(p) · GW(p)
Yep, LLMs are stochastic in the sense that there isn't literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there's plausibly a >99.999% probability that GPT-4's response to "What's the capital of France" includes the string "Paris")
Replies from: raphael-roche↑ comment by Raphael Roche (raphael-roche) · 2025-01-02T20:15:21.203Z · LW(p) · GW(p)
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research's article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-01-05T00:10:27.908Z · LW(p) · GW(p)
be bad.
This might be a good spot to swap out "bad" for "catastrophic."
comment by otto.barten (otto-barten) · 2025-01-02T21:39:49.984Z · LW(p) · GW(p)
Thanks for writing the post, it was insightful to me.
"This model is largely used for alignment and other safety research, e.g. it would compress 100 years of human AI safety research into less than a year"
In your mind, what would be the best case outcome of such "alignment and other safety research"? What would it achieve?
I'm expecting something like "solve the alignment problem". I'm also expecting you to think this might mean that advanced AI would be intent-aligned, that is, it would try to do what a user wants it to do, while not taking over the world. Is that broadly correct?
If so, the biggest missing piece for me is to understand how this would help to avoid that someone else builds an unaligned AI somewhere else with sufficient capabilities to take over. DeepSeek released a model with roughly comparable capabilities nine weeks after OpenAI's o1, probably without stealing weights. It seems to me that you have about nine weeks to make sure others don't build an unsafe AI. What's your plan to achieve that and how would the alignment and other safety research help?
comment by Asher Brass (asher-brass) · 2025-01-02T20:13:35.825Z · LW(p) · GW(p)
Great post! I agree with a lot of the reasoning and am also quite worried about insufficient preparedness for short timelines.
On security, you say:
Model weights (and IP) are secure: By this, I mean SL4 or higher...
I think it's worth explicitly stating that if AI companies only manage to achieve SL4, we should expect OC5 actors to successfully steal model weights, conditional on them deciding it's a top-level priority for them to do so.
However, this implication doesn't really jive with the rest of your comments regarding security and the amount of frontier actors. It seems to me that a lot of pretty reasonable plans or plan-shaped-objects, yours included, rely to an extent on the US maintaining a lead and doing responsible things, in which case SL4 would not be good enough.
It also includes securing algorithmic secrets to prevent malicious actors from catching up on their own (this likely requires a somewhat different set of interventions than securing model weights).
Yes! And training data would of course need to be secured as well. Otherwise poisoning or malicious modification attacks seem to me a bit scarier than simple model weight theft.
It is likely impossible to implement everything that would be suggested in the SL5 recommendations of the “securing model weights” report within the next 2-3 years. Thus, triaging seems necessary. I could also imagine that a lot of the security strategies that are important right around the point when an AI company automates their AI R&D have a few years of development time and might not feel immediately required right now, so they will only exist if an AI developer is acting now with a lot of foresight.
As mentioned, settling for sub-SL5 seems to me like asking for too little. AI companies almost certainly could not achieve SL5 on their own within 2-3 years, but that seems like it begs the question: "well then how do we get SL5 despite that"? Otherwise, whatever the plan is, it must survive the fact that AI companies are not secure to OC5 actors and will not achieve that security "in time".
One way to do this would be heavy governmental investment and assistance, or maybe convince the competing OC5s to join an international project (though this would violate one of your conservative assumptions for safety progress I think?). Another way to do it might be to rely on, as you gesture at, favorable differential access to frontier models - I'm personally pretty skeptical of the latter though, and I also just don't see AI companies dogfooding nearly hard enough for this to even have a chance.