What AI companies should do: Some rough ideas

post by Zach Stein-Perlman · 2024-10-21T14:00:10.412Z · LW · GW · 3 comments

Contents

3 comments

This post is incomplete. I'm publishing it because it might be helpful for some readers anyway. A good version of this post would be more detailed: for each proposed action, explain motivation and high-level goal, explain problem to solve, respond to obvious questions/objections, and lay out costs involved and costs worth accepting.

These are the actions I think companies that will build very powerful AI systems should prioritize to improve safety. I think something like these actions are cost-effective,[1] but specific details may be suboptimal. And some actions are not fully specified; some are of the form do good things in area X or avoid risk Y, without pinning down what to do. And this is not certain to be correct in the future; I just think it's a good plan for now. Some of these actions are costly, but I claim that they are cheap and important enough that a responsible frontier AI company should do them unilaterally — the gain in safety is worth the cost in competitiveness. I mostly don't justify these actions in this post.

(This post is about responsible labs — asks for non-responsible labs should be more focused on transparency and becoming responsible. This post is about unilateral action — in coordinated action you can accept more costs.)


Security for model weights and code: extremely strong security will be crucial when AIs can massively accelerate AI R&D [AF(p) · GW(p)]. But operating under extreme security—security sufficient to defend against top-priority operations by top state actors—is costly and not yet necessary. So labs should focus on achieving extreme security optionality—the ability to implement extreme security quickly[2]—before it would be very bad for AI model weights to be stolen.[3] It seems such AI could plausibly appear as early as 2027; if so, labs should plan to achieve extreme security optionality before 2027.[4] The lab should develop a roadmap to achieve extreme security optionality, publicly explain and justify its high-level security goals, and maybe share some details and publish stuff to validate claims (e.g. a pentester's summary of their pentest results) with sensitive details redacted.[5] It is also reasonable to aim for a weaker level of security, SL5, for now, but that will probably be insufficient when AIs can massively accelerate research.


Control: anticipate and prevent the risk that AIs will scheme—i.e., act safe and subtly subvert safety measures to gain power—and escape [AF · GW] during internal deployment.[6] The lab should do evals for scheming-relevant capabilities, and by the time models can plausibly scheme coherently, be ready to ensure control, e.g.:


Misuse: anticipate and prevent harm from misuse during external deployment. Do good model evals for dangerous capabilities. Do evals pre-deployment (to inform deployment decisions), have good threat models and categories of evals, use high-quality evals, use good elicitation, publish evals and results, share good access with third-party evaluators. Once models might have dangerous capabilities, deploy such that users can't access those capabilities: deploy via API or similar rather than releasing weights, monitor queries to block suspicious messages and identify suspicious users [AF · GW], maybe do harmlessness/refusal training and adversarial robustness. If a particular deployment plan incurs a substantial risk of catastrophic misuse, deploy more narrowly or cautiously.


Accountability for future deployment decisions: plan that by the time AIs are capable of tripling the rate of R&D, the lab will do something like this:[7]

Do great risk assessment, especially via model evals for dangerous capabilities. Present risk assessment results and deployment plans to external auditors — perhaps make risk cases and safety cases for each planned deployment, based on threat modeling, model eval results, and planned safety measures. Give auditors more information upon request. Have auditors publish summaries of the situation, how good a job you're doing, the main issues, their uncertainties, and an overall risk estimate.


Labs should have a clear, good plan for what to do when they build AGI, outside of just making the AGI safe.[8] For example, one plan is perfectly solve AI safety, then hand things off to some democratic process. This plan is bad because solving AI safety will likely take years after systems are transformative and superintelligence is attainable, and by default someone will build superintelligence without strong reason to believe they can do so safely or do something else unacceptable during that time.

The outline of the best plan I've heard is build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer[9] to them (before building wildly superintelligent AI). Assume it will take 5-10 years after AGI—i.e., AI that can obsolete top human scientists—to build such systems and give them sufficient time. To buy time (or: avoid being rushed by other AI projects[10]), inform the US government and convince it to enforce nobody builds wildly superintelligent AI for a while (and likely limit AGI weights to allied projects with excellent security and control). Use AGI to increase the government's willingness to pay for nonproliferation and to decrease the cost of nonproliferation (build carrots to decrease cost of others accepting nonproliferation and build sticks to decrease cost to US of enforcing nonproliferation).[11]

For now, the lab should develop and prepare to implement such a plan. It should also publicly articulate the basic plan, unless doing so would be costly or undermine the plan.


Safety research: labs should do and share safety research as a public good, to help make powerful AI safer even if it's developed by another lab.[12] They should also boost external safety research [LW · GW], especially by giving safety researchers deeper access[13] to externally deployed models.


Policy: inform policymakers about AI capabilities; when there is better empirical evidence on misalignment and scheming risk, inform policymakers about that; don't get in the way of good policy.


Labs should encourage staff to discuss their views on AI progress, risks, and safety, both internally and publicly (except for confidential information), or at least avoid discouraging staff from doing so. (Facilitating external whistleblowing has benefits but some forms of it are costly, and the benefits are smaller for more responsible labs, so I don't confidently recommend unilateral action for responsible labs.)


Public statements: the lab and its leadership should understand and talk about extreme risks. The lab should clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax [EA · GW]).


[meta] Plan safety interventions as a function of dangerous capabilities, warning signs, etc. Rather than acting based on guesses about which risks will appear, set practices in advance as a function of warning signs. (This should satisfy both risk-worried people who expect warning signs and risk-unworried people who don't.)


I list some less important or more speculative actions (very nonexhaustive) in this footnote.[14] See The Checklist: What Succeeding at AI Safety Will Involve [AF · GW] (Bowman 2024) for a collection of goals for a lab. See A Plan for Technical AI Safety with Current Science [LW(p) · GW(p)] (Greenblatt 2023) for a more detailed but rough and out-of-date plan. See Towards best practices in AGI safety and governance: A survey of expert opinion (Schuett et al. 2023) for an old list.


Almost none of this post is original. Thanks to Ryan Greenblatt for inspiring much of it, and thanks to Drake Thomas, Rohin Shah, Alexandra Bates, and Gabriel Mukobi for suggestions. They don't necessarily endorse this post.


Crossposted from AI Lab Watch. Subscribe on Substack.

  1. ^

    With two exceptions, where I feel unconfident: misuse prevention (misuse has positive second-order effects) and accountability for risk assessment (its causal benefit may be small if done unilaterally).

  2. ^

    Extreme security optionality is the ability to implement extreme security quickly, e.g. within 3 months. Also extreme security should be implementable at a reasonable maximum cost, e.g $10B to implement plus $10B per year and a 75% slowdown in research — the property the lab could implement extreme security but it would require basically stopping research or incurring other prohibitive costs doesn't suffice.

    I focus on extreme security for model weights and code, not necessarily for secrets. Securing weights will be more important; it seems easier; and extreme security for secrets entails that staff are much less able to talk to people outside the lab, which could hurt their ability to think clearly.

    Labs will likely need substantially better security than defend against top-priority operations by top state actors soon after AIs can massively accelerate AI R&D: model weights will be among the most valuable and strategically important assets ever, so labs should be prepared for "unprecedentedly well-resourced attacks," not to mention that AI will boost offense.

  3. ^

    Proliferation of model weights would lead to others quickly getting more powerful AI, since the AI can massively boost AI R&D. This removes the option of coordination that requires nonproliferation, enables misuse and state conflict, makes it hard for US labs to gain a substantial lead, and makes it impossible for US labs to unilaterally slow progress while retaining a lead.

  4. ^

    By the time extreme security is necessary, AIs will be able to substantially boost security. So for now, labs should focus on the things that will be hard for AI to accelerate: things that require nonparallelizable physical work. For example, labs should probably focus on physically constructing or retrofitting datacenters such that they can be extremely secure.

    If it becomes clear that the lab will achieve extreme security optionality before it's necessary, the lab could choose to revise the plan and slow its progress toward extreme security.

  5. ^

    Would an if-then security commitment—don't create a very dangerous model until its weights would be secure—suffice? Sure, but I worry that extreme security takes substantial time to implement (in part because it likely requires physical changes to datacenters) and the labs are not on track to implement it before they need it, so a lab will either proceed unsafely or pause while the less scrupulous race ahead. Regardless of whether a lab has such a commitment, I think it should start sprinting for extreme security optionality now.

  6. ^

     Maybe a large fraction of AI risk [AF(p) · GW(p)] "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored."

    Escape means "either escaping from the datacenter or getting root access to it." A central story for AI escaping goes like this: [AF · GW]

    At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they're seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it's worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories.

  7. ^

    This is partially a substitute for responsible scaling policies. Labs should still make at least high-level RSP-y (or "if-then") commitments too.

  8. ^

    A plan could include how to use AGI, what further AI development to do, and what the exit state is if all goes well. A plan should consider not just what the lab will do but also what will be happening with the US government, other labs, and other states.

    Really the goal is a playbook, not just a plan — it should address what happens if various assumptions aren't met or things are different than expected, should include ideas that are worth exploring/trying but not depending on, etc.

  9. ^

    We don't need these human-obsoleting AIs to be able to implement CEV [? · GW]. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed.

  10. ^

    To avoid being rushed by your own AI project, you also have to ensure that your AI can't be stolen and can't escape, so you have to implement excellent security and control.

  11. ^

    This plan, and desire for this kind of plan, is largely stolen from Ryan Greenblatt [AF(p) · GW(p)]. He doesn't necessarily endorse my version.

    Dario Amodei recently said something related:

    it seems very important that democracies have the upper hand on the world stage when powerful AI is created. . . . My current guess at the best way to do this is via an “entente strategy”, in which a coalition of democracies seeks to gain a clear advantage (even just a temporary one) on powerful AI by securing its supply chain, scaling quickly, and blocking or delaying adversaries’ access to key resources like chips and semiconductor equipment. This coalition would on one hand use AI to achieve robust military superiority (the stick) while at the same time offering to distribute the benefits of powerful AI (the carrot) to a wider and wider group of countries in exchange for supporting the coalition’s strategy to promote democracy (this would be a bit analogous to “Atoms for Peace”). The coalition would aim to gain the support of more and more of the world, isolating our worst adversaries and eventually putting them in a position where they are better off taking the same bargain as the rest of the world: give up competing with democracies in order to receive all the benefits and not fight a superior foe.

    (Among other things, this does not address the likely possibility that the democracies should delay building wildly superintelligent AI.)

  12. ^

    This is slightly complicated because (1) some safety research boosts capabilities and (2) some safety practices work better if their existence (in the case of control) or details (in the case of adversarial robustness) are secret. In some cases you can share with other labs to get most of the benefits and little of the cost. Regardless, lots of safety research is good to publish. And publishing model evals is good for some of the same reasons plus transparency and accountability.

  13. ^

    E.g. fine-tuning, helpful-only, filters/moderation off, logprobs, activations.

  14. ^
    • RSP details (but it's not currently possible to do all of this very well)
      • See generally Key Components of an RSP
      • Do evals regularly
      • And have predefined high-level capability thresholds
      • And have predefined low-level operationalizations
      • And publish evals and thresholds, or have an auditor comment
      • And have thresholds trigger specific responses
      • And all these details should be good
      • And prepare for a pause (to make it less costly, in case it's necessary)
      • And accountability/transparency/meta stuff: publish updates or be transparent to an auditor, have a decent update process, elicit external review, perhaps adopt a mechanism for staff to flag false statements
    • Elicitation details (for model evals for dangerous capabilities): see the summary in my blogpost [LW · GW] (or my table)
    • Don't publish capabilities research, modulo Thoughts on sharing information about language model capabilities [AF · GW] (Christiano 2023) — don't publish stuff that boosts black-box capabilities (especially stuff on training) unless it improves safety
    • Internal governance
      • Prepare for difficult decisions ahead
      • Have structure, mandate, and incentives set up such that the lab can prioritize safety in key decisions.
      • Plan what the lab and its staff would do if it needed to pause for safety
      • Have a good process for staff to escalate concerns about safety internally
      • Don't discourage certain kinds of whistleblowing — details unclear
      • Don't use nondisparagement provisions in contracts or severance agreements
    • Object-level security practices
    • Various specific safety techniques (potentially high-priority but I just didn't focus this post on them) (not necessarily currently existing) (very nonexhaustive)

3 comments

Comments sorted by top scores.

comment by Zach Stein-Perlman · 2024-10-21T14:00:44.284Z · LW(p) · GW(p)

I may edit this post in a week based on feedback. You can suggest changes that don't merit their own LW comments here. That said, I'm not sure what feedback I want.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-10-21T21:05:55.176Z · LW(p) · GW(p)

share good access with third-party evaluators

Glad to see my personal hobbyhorse got a mention!

Overall this is great, and I have just one concern about a possible what-if case that isn't covered.

Model development secrets being disproportionately important.

I know, it's a really tricky thing to deal with if this becomes the case. But I think it's worth having a 'what if' plan in place in case this suddenly happens. Imagine a lab is working on their research and some researcher stumbles across an algorithmic innovation which makes a million-fold improvement in training efficiency. The peak capability level of the model that can now be trained for a few thousand dollars is now on par with their leading multi-billion dollar frontier model. Furthermore, this peak capability level is sufficient for the 3x speedup in AI R&D that you specified as being a threshold of high concern. Under such a circumstance, this secret becomes even more dangerous and valuable than the multi-billion-dollar-training-cost frontier model weights. Seems like having a plan in place for what if this happens would be wise.

Something like: don't tell your co-workers, report to so-and-so specific person whose job it is to handle evaluating and reporting up-the-chain about possible dangerous algorithmic developments. Probably you'd need a witness protection / isolation setup for the researchers who'd been exposed to the secret. You'd need to start planning around how soon others might stumble onto the same discovery, what other similar discoveries might be out there to be found, what government actions should be taken, etc.

I know this sounds like an implausible scenario to most people currently, but I think it's not something ruled out as physically impossible, and is worth having a what-if plan for. 

comment by Zach Stein-Perlman · 2024-10-21T20:15:12.685Z · LW(p) · GW(p)

A related/overlapping thing I should collect: basic low-hanging fruit for safety. E.g.:

  • Have a high-level capability threshold at which securing model weights is very important
  • Evaluate whether your models are capable at agentic tasks; look at the score before releasing or internally deploying them