Posts

DeepSeek beats o1-preview on math, ties on coding; will release weights 2024-11-20T23:50:26.597Z
Anthropic: Three Sketches of ASL-4 Safety Case Components 2024-11-06T16:00:06.940Z
The current state of RSPs 2024-11-04T16:00:42.630Z
Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority 2024-10-28T17:00:18.660Z
UK AISI: Early lessons from evaluating frontier AI systems 2024-10-25T19:00:21.689Z
Lab governance reading list 2024-10-25T18:00:28.346Z
IAPS: Mapping Technical Safety Research at AI Companies 2024-10-24T20:30:41.159Z
What AI companies should do: Some rough ideas 2024-10-21T14:00:10.412Z
Anthropic rewrote its RSP 2024-10-15T14:25:12.518Z
Model evals for dangerous capabilities 2024-09-23T11:00:00.866Z
OpenAI o1 2024-09-12T17:30:31.958Z
Demis Hassabis — Google DeepMind: The Podcast 2024-08-16T00:00:04.712Z
GPT-4o System Card 2024-08-08T20:30:52.633Z
AI labs can boost external safety research 2024-07-31T19:30:16.207Z
Safety consultations for AI lab employees 2024-07-27T15:00:27.276Z
New page: Integrity 2024-07-10T15:00:41.050Z
Claude 3.5 Sonnet 2024-06-20T18:00:35.443Z
Anthropic's Certificate of Incorporation 2024-06-12T13:00:30.806Z
Companies' safety plans neglect risks from scheming AI 2024-06-03T15:00:20.236Z
AI companies' commitments 2024-05-29T11:00:31.339Z
Maybe Anthropic's Long-Term Benefit Trust is powerless 2024-05-27T13:00:47.991Z
AI companies aren't really using external evaluators 2024-05-24T16:01:21.184Z
New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z
Frontier AI Regulation 2023-07-10T14:30:06.366Z
AI labs' statements on governance 2023-07-04T16:30:01.624Z
DeepMind: Model evaluation for extreme risks 2023-05-25T03:00:00.915Z
GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion 2023-05-15T01:42:41.012Z
Stopping dangerous AI: Ideal US behavior 2023-05-09T21:00:55.187Z

Comments

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T02:40:10.847Z · LW · GW

Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T02:40:02.374Z · LW · GW

Some context: June FT article on DeepSeek; parent hedge fund Wikipedia page.

Comment by Zach Stein-Perlman on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:35:05.522Z · LW · GW

I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).

I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).

Comment by Zach Stein-Perlman on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:02:27.805Z · LW · GW

It would be crazy to show the untrusted model the old CoT when resampling.

Comment by Zach Stein-Perlman on A Qualitative Case for LTFF: Filling Critical Ecosystem Gaps · 2024-11-18T05:02:55.099Z · LW · GW

Open Philanthropy's AI safety work tends toward large grants in the high hundreds of thousands or low millions, meaning individuals and organizations with lower funding needs won't be funded by them

This is directionally true but projects seeking less money should still apply to OP if relevant; a substantial minority of recent OP AI safety and GCR capacity building grants are <$100K.

Comment by Zach Stein-Perlman on Buck's Shortform · 2024-11-14T18:50:12.813Z · LW · GW

What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?

Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

Comment by Zach Stein-Perlman on Buck's Shortform · 2024-11-14T18:24:46.534Z · LW · GW

Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-11-09T04:37:19.378Z · LW · GW

We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.

Especially if the AI is even sandbagging on simply-coding when it thinks it's for safety research. And if it's not doing that, we can get some useful safety work out of it.

@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan's comments.

Comment by Zach Stein-Perlman on evhub's Shortform · 2024-11-09T04:13:46.853Z · LW · GW

I mostly agree. But I think some kinds of autonomous weapons would make loss-of-control and coups easier. But boosting US security is good so the net effect is unclear. And that's very far from the recent news (and Anthropic has a Usage Policy, with exceptions, which disallows various uses — my guess is this is too strong on weapons).

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-11-09T03:00:38.529Z · LW · GW

I just claim that if so it's easy to catch. (And then (1) catching it is nice for various reasons and (2) maybe countermeasures work.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-11-08T20:30:10.031Z · LW · GW

I brainstormed sabotage-related threats with friends:

  • Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that's high-level or has poor feedback loops
    • Plausible; maybe not so bad
  • Sandbag on ~all safety research tasks
    • Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
  • Sandbag/lie when validating that a model/deployment/whatever is safe
    • idk
  • When automating AI R&D, backdoor the model
    • (this is takeover-attempt-y, not sandbagging-y)
    • idk
  • When deployed to automate/improve security (especially for the model's own weights): both sandbagging and code-backdooring/hacking/etc.
    • idk
  • Sandbagging on AI R&D to help another lab win the race (!?)
    • Implausible
  • ??

Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?

Comment by Zach Stein-Perlman on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-07T17:10:19.422Z · LW · GW

See [V1.1]

Comment by Zach Stein-Perlman on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T16:45:07.175Z · LW · GW

When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:

  1. Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
  2. Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not

Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)

[Edited]

Comment by Zach Stein-Perlman on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T16:00:25.325Z · LW · GW

My initial reactions on a quick read:

  1. Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
  2. Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
  3. Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate
Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-31T21:00:53.706Z · LW · GW

Anthropic: The case for targeted regulation.

I like the "Urgency" section.

Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.

Edit: some strong versions of "incentivize companies to develop RSPs that are effective" would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.

Comment by Zach Stein-Perlman on AI #88: Thanks for the Memos · 2024-10-31T18:53:14.628Z · LW · GW

After Miles Brundage left to do non-profit work, OpenAI disbanded the “AGI Readiness” team he had been leading, after previously disbanding the Superalignment team and reassigning the head of the preparedness team. I do worry both about what this implies, and that Miles Brundage may have made a mistake leaving given his position.

Do we know that this is the causal story? I think OpenAI decided disempower/disband/ignore Readiness and so Miles left is likely.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-30T19:00:59.774Z · LW · GW

Some not-super-ambitious asks for labs (in progress):

  • Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
  • Have a high-level capability threshold at which securing model weights is very important
  • Do safety research at all
  • Have a safety plan at all; talk publicly about safety strategy at all
    • Have a post like Core Views on AI Safety
    • Have a post like The Checklist
    • On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
  • Whistleblower protections?
    • [Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
  • Publicly explain the processes and governance structures that determine deployment decisions
    • (And ideally make those processes and structures good)
Comment by Zach Stein-Perlman on Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority · 2024-10-28T23:10:09.066Z · LW · GW

demonstrating its benigness is not the issue, it's actually making them benign

This also stood out to me.

Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you're benign + actually be benign.

For now, model evals for dangerous capabilities (along with assurance that you're evaluating the lab's most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).

Two specific places I think are misleading:

Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.

. . .

We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.

The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you're there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.

Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it's still a good post on another problem.

Comment by Zach Stein-Perlman on IAPS: Mapping Technical Safety Research at AI Companies · 2024-10-26T21:31:32.594Z · LW · GW

@Zoe Williams 

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-25T18:55:16.425Z · LW · GW

Weaker:

  • AI R&D threshold: yes, the threshold is much higher
  • CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
  • ASL-3 deployment standard: yes; the changes don't feel huge but the new standard doesn't feel very meaningful
  • ASL-3 security standard: no (but both old and new are quite vague)

Vaguer: yes. But the old RSP didn't really have "precisely defined capability thresholds and mitigation measures." (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)

(New RSP, old RSP.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-24T22:31:17.620Z · LW · GW

Internet comments sections where the comments are frequently insightful/articulate and not too full of obvious-garbage [or at least the top comments, given the comment-sorting algorithm]:

  • LessWrong and EA Forum
  • Daily Nous
  • Stack Exchange [top comments only] (not asserting that it's reliable)
  • Some subreddits?

Where else?

Are there blogposts on adjacent stuff, like why some internet [comments sections / internet-communities-qua-places-for-truthseeking] are better than others? Besides Well-Kept Gardens Die By Pacifism.

Comment by Zach Stein-Perlman on IAPS: Mapping Technical Safety Research at AI Companies · 2024-10-24T21:07:46.141Z · LW · GW

How does this paper suggest "that most safety research doesn't come from the top labs"?

Comment by Zach Stein-Perlman on What AI companies should do: Some rough ideas · 2024-10-22T04:50:49.019Z · LW · GW

Nice. I mostly agree on current margins — the more I mistrust a lab, the more I like transparency.

I observe that unilateral transparency is unnecessary on the view-from-the-inside if you know you're a reliably responsible lab. And some forms of transparency are costly. So the more responsible a lab seems, the more sympathetic we should be to it saying "we thought really hard and decided more unilateral transparency wouldn't be optimific." (For some forms of transparency.)

Comment by Zach Stein-Perlman on What AI companies should do: Some rough ideas · 2024-10-21T20:15:12.685Z · LW · GW

A related/overlapping thing I should collect: basic low-hanging fruit for safety. E.g.:

  • Have a high-level capability threshold at which securing model weights is very important
  • Evaluate whether your models are capable at agentic tasks; look at the score before releasing or internally deploying them
Comment by Zach Stein-Perlman on What AI companies should do: Some rough ideas · 2024-10-21T14:00:44.284Z · LW · GW

I may edit this post in a week based on feedback. You can suggest changes that don't merit their own LW comments here. That said, I'm not sure what feedback I want.

Comment by Zach Stein-Perlman on LLMs can learn about themselves by introspection · 2024-10-19T18:15:52.822Z · LW · GW

I'm confused/skeptical about this being relevant, I thought honesty is orthogonal to whether the model has access to its mental states.

Comment by Zach Stein-Perlman on LLMs can learn about themselves by introspection · 2024-10-18T23:30:57.560Z · LW · GW

I haven't read the paper. I think doing a little work on introspection is worthwhile. But I naively expect that it's quite intractable to do introspection science when we don't have access to the ground truth, and such cases are the only important ones. Relatedly, these tasks are trivialized by letting the model call itself, while letting the model call itself gives no help on welfare or "true preferences" introspection questions, if I understand correctly. [Edit: like, the inner states here aren’t the important hidden ones.]

Comment by Zach Stein-Perlman on LLMs can learn about themselves by introspection · 2024-10-18T19:00:15.081Z · LW · GW

Why work on introspection?

Comment by Zach Stein-Perlman on If I have some money, whom should I donate it to in order to reduce expected P(doom) the most? · 2024-10-18T16:49:31.565Z · LW · GW

I don't know. I'm not directly familiar with CLTR's work — my excitement about them is deference-based. (Same for Horizon and TFS, mostly. I inside-view endorse the others I mention.)

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-16T04:25:36.949Z · LW · GW

Mea culpa. Sorry. Thanks.

Update: I think I've corrected this everywhere I've said it publicly.

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-16T01:45:53.529Z · LW · GW

No, in that post METR says it's excited about trying auditing, but "it was all under NDA" and "We also didn’t have the access necessary to perform a proper evaluation." Anthropic could commit to share with METR pre-deployment, give them better access, and let them publish stuff about their findings. I don't know if that would turn out well, but Anthropic could be trying much harder.

And auditing doesn't just mean model evals for dangerous capabilities — it could also be for security. (Or procedural stuff, but that doesn't solve the object-level problem.)

Sidenote: credit to Sam Bowman for saying

I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-15T22:38:36.976Z · LW · GW

Oops, I forgot https://www.anthropic.com/rsp-updates. This is great. I really like that Anthropic shares "non-binding descriptions of our future ASL-3 safeguard plans."

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-15T20:20:04.277Z · LW · GW

(I agree that the intention is surely no more than 6 months; I'm mostly annoyed for legibility—things like this make it harder for me to say "Anthropic has clearly committed to X" for lab-comparison purposes—and LeCun-test reasons)

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-15T19:41:56.344Z · LW · GW

Thanks.

  1. I disagree, e.g. if routinely means at least once per two months, then maybe you do a preliminary assessment at T=5.5 months and then don't do the next until T=7.5 months and so don't do a comprehensive assessment for over 6 months.
    1. Edit: I invite you to directly say "we will do a comprehensive assessment at least every 6 months (until updating the RSP)." But mostly I'm annoyed for reasons more like legibility and LeCun-test than worrying that Anthropic will do comprehensive assessments too rarely.
  2. [no longer endorsed; mea culpa] I know this is what's going on in y'all's heads but I don't buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don't see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.
Comment by Zach Stein-Perlman on Model evals for dangerous capabilities · 2024-10-15T18:30:48.300Z · LW · GW

Correction:

On whether Anthropic uses chain-of-thought, I said "Yes, but not explicit, but implied by discussion of elicitation in the RSP and RSP evals report." This impression was also based on conversations with relevant Anthropic staff members. Today, Anthropic says "Some of our evaluations lacked some basic elicitation techniques such as best-of-N or chain-of-thought prompting." Anthropic also says it believes the elicitation gap is small, in tension with previous statements.

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-15T17:46:34.490Z · LW · GW

Old ASL-3 standard: see original RSP pp. 7-8 and 21-22. New ASL-3 standard. Quotes below.

Also note the original RSP said

We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the [RAND] report’s publication.

The RAND report was published in late May but this list never appeared.

Also the old RSP was about "containment" rather than just "security": containment is supposed to address risk of model self-exfiltration in addition to risk of weights being stolen. (But not really at ASL-3.) The new RSP is just about security.


Old ASL-3 security:

At ASL-3, labs should harden security against non-state attackers and provide some defense against state-level attackers. We commit to the following security themes. Similarly to ASL-2, this summary previews the key security measures at a high level and is based on the forthcoming RAND report. We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the report’s publication.

These requirements are cumulative above the ASL-2 requirements.

  • At the software level, there should be strict inventory management tracking all software components used in development and deployment. Adhering to specifications like SSDF and SLSA, which includes a secure build pipeline and cryptographic signature enforcement at deployment time, must provide tamper-proof infrastructure. Frequent software updates and compliance monitoring must maintain security over time.
  • On the hardware side, sourcing should focus on security-minded manufacturers and supply chains. Storage of sensitive weights must be centralized and restricted. Cloud network infrastructure must follow secure design patterns.
  • Physical security should involve sweeping premises for intrusions. Hardware should be hardened to prevent external attacks on servers and devices.
  • Segmentation should be implemented throughout the organization to a high threshold limiting blast radius from attacks. Access to weights should be indirect, via managed interfaces rather than direct downloads. Software should place limitations like restricting third-party services from accessing weights directly. Employees must be made aware that weight interactions are monitored. These controls should scale as an organization scales.
  • Ongoing monitoring such as compromise assessments and blocking of malicious queries should be both automated and manual. Limits must be placed on the number of inferences for each set of credentials. Model interactions that could bypass monitoring must be avoided.
  • Organizational policies must aim to enforce security through code, limiting reliance on manual compliance.
  • To scale to meet the risk from people-vectors, insider threat programs should be hardened to require multi-party controls and incentivize reporting risks. Endpoints should be hardened to run only allowed software.
  • Pen-testing, diverse security experience, concrete incident experience, and funding for substantial capacity all should contribute. A dedicated, resourced security red team with ongoing access to design and code must support testing for insider threats. Effective honeypots should be set up to detect attacks.

New ASL-3 security:

When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.

We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees,[1] and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).

The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.

To make the required showing, we will need to satisfy the following criteria:

  1. Threat modeling: Follow risk governance best practices, such as use of the MITRE ATT&CK Framework to establish the relationship between the identified threats, sensitive assets, attack vectors and, in doing so, sufficiently capture the resulting risks that must be addressed to protect model weights from theft attempts. As part of this requirement, we should specify our plans for revising the resulting threat model over time.
  2. Security frameworks: Align to and, as needed, extend industry-standard security frameworks for addressing identified risks, such as disclosure of sensitive information, tampering with accounts and assets, and unauthorized elevation of privileges with the appropriate controls. This includes:
    1. Perimeters and access controls: Building strong perimeters and access controls around sensitive assets to ensure AI models and critical systems are protected from unauthorized access. We expect this will include a combination of physical security, encryption, cloud security, infrastructure policy, access management, and weight access minimization and monitoring.
    2. Lifecycle security: Securing links in the chain of systems and software used to develop models, to prevent compromised components from being introduced and to ensure only trusted code and hardware is used. We expect this will include a combination of software inventory, supply chain security, artifact integrity, binary authorization, hardware procurement, and secure research development lifecycle.
    3. Monitoring: Proactively identifying and mitigating threats through ongoing and effective monitoring, testing for vulnerabilities, and laying traps for potential attackers. We expect this will include a combination of endpoint patching, product security testing, log management, asset monitoring, and intruder deception techniques.
    4. Resourcing: Investing sufficient resources in security. We expect meeting this standard of security to require roughly 5-10% of employees being dedicated to security and security-adjacent work.
    5. Existing guidance: Aligning where appropriate with existing guidance on securing model weights, including Securing AI Model Weights, Preventing Theft and Misuse of Frontier Models (2024); security recommendations like Deploying AI Systems Securely (CISA/NSA/FBI/ASD/CCCS/GCSB /GCHQ), ISO 42001, CSA’s AI Safety Initiative, and CoSAI; and standards frameworks like SSDF, SOC 2, NIST 800-53.
  3. Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.
  4. Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.

The new standard is more vague and meta, and Anthropic indeed abandoned various specifics from the original RSP. I think it is very far from passing the LeCun test; it's more like talking about security themes than making an object-level if-then commitment. I don't think Anthropic's security at ASL-3 is a huge deal, and I expect Anthropic be quite non-LeCun-y, but I think this is just too vague for me to feel good about.

In May Anthropic said "around 8% of all Anthropic employees are now working on security-adjacent areas and we expect that proportion to grow further as models become more economically valuable to attackers." Apparently they don't expect to increase that for ASL-3.


Edit: I changed my mind. I agree with Ryan's comment below. I wish we lived in a world where it's optimal to make strong object-level if-then commitments, but we don't, since we don't know which mitigations will be best to focus on. Tying hands to implement specific mitigations would waste resources. Better to make more meta commitments. (Strong versions require external auditing.)

  1. ^

    We will implement robust insider risk controls to mitigate most insider risk, but consider mitigating risks from highly sophisticated state-compromised insiders to be out of scope for ASL-3. We are committed to further enhancing these protections as a part of our ASL-4 preparations.

Comment by Zach Stein-Perlman on Anthropic's Certificate of Incorporation · 2024-10-13T03:00:07.994Z · LW · GW

Quick updates:

  • I worried this was a loophole: "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company’s stockholders who have held a sufficient percentage of the company’s equity for a sufficient period of time." An independent person told me it's a normal Delaware law thing and it's only relevant if the Trust breaks the rules. Yay! This is good news, but I haven't verified it and I'm still somewhat suspicious (but this is mostly on me, not Anthropic, to figure out).
  • I started trying to deduce stuff about who holds how many shares. I think this isn't really doable from public information, e.g. some Series C shares have voting power and some don't and I don't know how to tell which investors have which. But I got a better understanding of this stuff and I now tentatively think the Transfer Approval Threshold is super reasonable. Yay!
    • Also even if I understood the VC investments perfectly I think I'd be confused by the (much larger?) investments by Amazon and Google
    • If you're savvy in this and want to help, let me know and I'll share my rough notes

(On the other hand, it's pretty concerning that the Trust hasn't replaced Jason, who left in December, nor Paul, who left in April. Also note that it hasn't elected a board member to fill Luke's seat.)

I guess my LTBT-related asks for Anthropic are now:

  • Commit that you'll promptly announce publicly if certain big LTBT-related changes happen
  • Publish the Trust Agreement (and any other relevant documents), or at least let a non-Anthropic person I trust read it (under an NDA if you want) and publish whether they have big concerns

(This isn't high-priority for me; I just get annoyed when Anthropic brags about its LTBT without having justified that it's great.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-09T01:00:13.295Z · LW · GW

I think this post was underrated; I look back at it frequently: AI labs can boost external safety research. (It got some downvotes but no comments — let me know if it's wrong/bad.)

[Edit: it was at 15 karma.]

Comment by Zach Stein-Perlman on Dan Braun's Shortform · 2024-10-05T20:09:51.107Z · LW · GW

I agree safety-by-control kinda requires good security. But safety-by-alignment kinda requires good security too.

Comment by Zach Stein-Perlman on If I have some money, whom should I donate it to in order to reduce expected P(doom) the most? · 2024-10-03T19:10:40.508Z · LW · GW

I am excited about donations to all of the following, in no particular order:

  • AI governance
    • GovAI (mostly research) [actually I haven't checked whether they're funding-constrained]
    • IAPS (mostly research)
    • Horizon (field-building)
    • CLTR (policy engagement)
    • Edit: also probably The Future Society (policy engagement, I think) and others but I'm less confident
  • LTFF/ARM
  • Lightcone
Comment by Zach Stein-Perlman on Base LLMs refuse too · 2024-09-29T23:16:32.251Z · LW · GW

Interesting. Thanks. (If there's a citation for this, I'd probably include it in my discussions of evals best practices.)

Hopefully evals almost never trigger "spontaneous sandbagging"? Hacking and bio capabilities and so forth are generally more like carjacking than torture.

Comment by Zach Stein-Perlman on Leon Lang's Shortform · 2024-09-29T22:02:52.842Z · LW · GW

(Just the justification, of course; fixed.)

Comment by Zach Stein-Perlman on Base LLMs refuse too · 2024-09-29T22:00:40.053Z · LW · GW

I agree noticing whether the model is refusing and, if so, bypassing refusals in some way is necessary for good evals (unless the refusal is super robust—such that you can depend on it for safety during deployment—and you're not worried about rogue deployments). But that doesn't require fine-tuning — possible alternatives include jailbreaking or few-shot prompting. Right?

(Fine-tuning is nice for eliciting stronger capabilities, but that's different.)

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-28T18:22:35.496Z · LW · GW

I think I was thinking:

  1. The war room transcripts will leak publicly
  2. Generals can secretly DM each other, while keeping up appearances in the shared channels
  3. If a general believes that all of their communication with their team will leak, we're be back to a unilateralist's curse situation: if a general thinks they should nuke, obviously they shouldn't say that to their team, so maybe they nuke unilaterally
    1. (Not obvious whether this is an infohazard)
  4. [Probably some true arguments about the payoff matrix and game theory increase P(mutual destruction). Also some false arguments about game theory — but maybe an infohazard warning makes those less likely to be posted too.]

(Also after I became a general I observed that I didn't know what my "launch code" was; I was hoping the LW team forgot to give everyone launch codes and this decreased P(nukes); saying this would would cause everyone to know their launch codes and maybe scare the other team.)

I don't think this is very relevant to real-world infohazards, because this is a game with explicit rules and because in the real world the low-hanging infohazards have been shared, but it seems relevant to mechanism design.

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-27T18:30:59.210Z · LW · GW

Update with two new responses:

I think this is 10 generals, 1 petrov, and one other person (either the other petrov or a citizen, not sure, wasn't super rigorous)

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:51:49.845Z · LW · GW

The post says generals' names will be published tomorrow.

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:50:51.202Z · LW · GW

No. I noticed ~2 more subtle infohazards and I was wishing for nobody to post them and I realized I can decrease that probability by making an infohazard warning.

I ask that you refrain from being the reason that security-by-obscurity fails, if you notice subtle infohazards.

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:29:37.419Z · LW · GW

Some true observations are infohazards, making destruction more likely. Please think carefully before posting observations. Even if you feel clever. You can post hashes here instead to later reveal how clever you were, if you need.

LOOSE LIPS SINK SHIPS

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:06:27.615Z · LW · GW

I think it’s better to be angry at the team that launched the nukes?

Comment by Zach Stein-Perlman on Mira Murati leaves OpenAI/ OpenAI to remove non-profit control · 2024-09-26T02:45:00.888Z · LW · GW

Two other leaders are also leaving