New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z
Frontier AI Regulation 2023-07-10T14:30:06.366Z
AI labs' statements on governance 2023-07-04T16:30:01.624Z
DeepMind: Model evaluation for extreme risks 2023-05-25T03:00:00.915Z
GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion 2023-05-15T01:42:41.012Z
Stopping dangerous AI: Ideal US behavior 2023-05-09T21:00:55.187Z
Stopping dangerous AI: Ideal lab behavior 2023-05-09T21:00:19.505Z
Slowing AI: Crunch time 2023-05-03T15:00:12.495Z
Ideas for AI labs: Reading list 2023-04-24T19:00:00.832Z
Slowing AI: Interventions 2023-04-18T14:30:35.746Z
AI policy ideas: Reading list 2023-04-17T19:00:00.604Z
Slowing AI: Foundations 2023-04-17T14:30:09.427Z
Slowing AI: Reading list 2023-04-17T14:30:02.467Z
FLI report: Policymaking in the Pause 2023-04-15T17:01:06.727Z
FLI open letter: Pause giant AI experiments 2023-03-29T04:04:23.333Z
Operationalizing timelines 2023-03-10T16:30:01.654Z
Taboo "compute overhang" 2023-03-01T19:15:02.515Z
The public supports regulating AI for safety 2023-02-17T04:10:03.307Z
Framing AI strategy 2023-02-07T19:20:04.535Z
AI safety milestones? 2023-01-23T21:00:24.441Z
Sealed predictions thread 2022-05-07T18:00:04.705Z
Rationalism for New EAs 2021-10-18T16:00:18.692Z
Great Power Conflict 2021-09-17T15:00:17.039Z
Zach Stein-Perlman's Shortform 2021-08-29T18:00:56.148Z
The Governance Problem and the "Pretty Good" X-Risk 2021-08-29T18:00:28.190Z


Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-05-23T04:51:28.112Z · LW · GW

Now OpenAI publicly said "we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual." This seems to be self-effecting; by saying it, OpenAI made it true.


Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-05-22T23:34:02.833Z · LW · GW

We know various people who've left OpenAI and might criticize it if they could. Either most of them will soon say they're free or we can infer that OpenAI was lying/misleading.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-05-22T23:00:07.719Z · LW · GW

New Kelsey Piper article and twitter thread on OpenAI equity & non-disparagement.

It has lots of little things that make OpenAI look bad. It further confirms that OpenAI threatened to revoke equity unless employees signed the non-disparagement agreements Plus it shows Altman's signature on documents giving the company broad power over employees' equity — perhaps he doesn't read every document he signs, but this one seems quite important. This is all in tension with Altman's recent tweet that "vested equity is vested equity, full stop" and "i did not know this was happening." Plus "we have never clawed back anyone's vested equity, nor will we do that if people do not sign a separation agreement (or don't agree to a non-disparagement agreement)" is misleading given that they apparently regularly threatened to do so (or something equivalent — let the employee nominally keep their PPUs but disallow them from selling them) whenever an employee left.

Great news:

OpenAI told me that “we are identifying and reaching out to former employees who signed a standard exit agreement to make it clear that OpenAI has not and will not cancel their vested equity and releases them from nondisparagement obligations”

(Unless "employees who signed a standard exit agreement" is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.)

I hope to soon hear from various people that they have been freed from their nondisparagement obligations.

Update: OpenAI says:

As we shared with employees today, we are making important updates to our departure process. We have not and never will take away vested equity, even when people didn't sign the departure documents. We're removing nondisparagement clauses from our standard departure paperwork, and we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We'll communicate this message to former employees. We're incredibly sorry that we're only changing this language now; it doesn't reflect our values or the company we want to be.

[Low-effort post; might have missed something important.]

[Substantively edited after posting.]

Comment by Zach Stein-Perlman on jacquesthibs's Shortform · 2024-05-21T20:54:27.742Z · LW · GW

50% I'll do this in the next two months if nobody else does. But not right now, and someone else should do it too.

Off the top of my head (this is not the list you asked for, just an outline):

  • Loopt stuff
  • YC stuff
  • YC removal
  • NDAs
    • And deceptive communication recently
    • And maybe OpenAI's general culture of don't publicly criticize OpenAI
  • Profit cap non-transparency
  • Superalignment compute
  • Two exoduses of safety people; negative stuff people-who-quit-OpenAI sometimes say
  • Telling board members not to talk to employees
  • Board crisis stuff
    • OpenAI executives telling the board Altman lies
    • The board saying Altman lies
    • Lying about why he wanted to remove Toner
    • Lying to try to remove Toner
    • Returning
    • Inadequate investigation + spinning results

Stuff not worth including:

  • Reddit stuff - unconfirmed
  • Financial conflict-of-interest stuff - murky and not super important
  • Misc instances of saying-what's-convenient (e.g. OpenAI should scale because of the prospect of compute overhang and the $7T chip investment thing) - idk, maybe, also interested in more examples
  • Johansson & Sky - not obvious that OpenAI did something bad, but it would be nice for OpenAI to say "we had plans for a Johansson voice and we dropped that when Johansson said no," but if that was true they'd have said it...

What am I missing? Am I including anything misleading or not-worth-it?

Comment by Zach Stein-Perlman on New voluntary commitments (AI Seoul Summit) · 2024-05-21T15:29:07.712Z · LW · GW

Quoting me from last time you said this:

The label "RSP" isn't perfect but it's kinda established now. My friends all call things like this "RSPs." . . . I predict change in terminology will happen ~iff it's attempted by METR or multiple frontier labs together. For now, I claim we should debate terminology occasionally but follow standard usage when trying to actually communicate.

Comment by Zach Stein-Perlman on Questions for labs · 2024-05-21T06:58:10.824Z · LW · GW

Maybe setting up custom fine-tuning is hard and labs often only set it up during deployment...

(Separately, it would be nice if OpenAI and Anthropic let some safety researchers do fine-tuning now.)

Comment by Zach Stein-Perlman on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T05:37:34.750Z · LW · GW

I think Frontier Red Team is about eliciting model capabilities and Alignment Stress Testing is about "red-team[ing] Anthropic’s alignment techniques and evaluations, empirically demonstrating ways in which Anthropic’s alignment strategies could fail."

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-05-21T03:44:46.838Z · LW · GW


Any takes on what info a company could publish to demonstrate "the adequacy of its safety culture and governance"? (Or recommended reading?)

Ideally criteria are objectively evaluable / minimize illegible judgment calls.

Comment by Zach Stein-Perlman on DeepMind's "​​Frontier Safety Framework" is weak and unambitious · 2024-05-20T17:47:02.162Z · LW · GW


Deployment mitigations level 2 discusses the need for mitigations on internal deployments.

Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.

ML R&D will require thinking about internal deployments (and so will many of the other CCLs).

OK. I hope DeepMind does that thinking and makes appropriate commitments.

two-party control

Thanks. I'm pretty ignorant on this topic.

"every 3 months of fine-tuning progress" was meant to capture [during deployment] as well


Comment by Zach Stein-Perlman on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-20T06:18:26.876Z · LW · GW


I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

Not sure. I can generally imagine a company publishing what Anthropic has published but having a weak/fake system in reality. Policy details do seem less important for non-compliance reporting than some other policies — Anthropic says it has an infohazard review policy, and I expect it's good, but I'm not confident, and for other companies I wouldn't necessarily expect that their policy is good (even if they say a formal policy exists), and seeing details (with sensitive bits redacted) would help.

I mostly take back my secret policy is strong evidence of bad policy insinuation — that's ~true on my home planet, but on Earth you don't get sufficient credit for sharing good policies and there's substantial negative EV from misunderstandings and adversarial interpretations, so I guess it's often correct to not share :(

As an 80/20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it's good or have concerns. I would feel better if that happened all the time.

Comment by Zach Stein-Perlman on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-20T05:34:32.920Z · LW · GW

I think this is implicit — the RSP discusses deployment mitigations, which can't be enforced if the weights are shared.

Comment by Zach Stein-Perlman on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-20T05:32:01.972Z · LW · GW

No major news here, but some minor good news, and independent of news/commitments/achievements I'm always glad when labs share thoughts like this. Misc reactions below.

Probably the biggest news is the Claude 3 evals report. I haven't read it yet. But at a glance I'm confused: it sounds like "red line" means ASL-3 but they also operationalize "yellow line" evals and those sound like the previously-discussed ASL-3 evals. Maybe red is actual ASL-3 and yellow is supposed to be at least 6x effective compute lower, as a safety buffer.

"Assurance Mechanisms . . . . should ensure that . . . our safety and security mitigations are validated publicly or by disinterested experts." This sounds great. I'm not sure what it looks like in practice. I wish it was clearer what assurance mechanisms Anthropic expects or commits to implement and when, and especially whether they're currently doing anything along the lines of "validated publicly or by disinterested experts." (Also whether "validated" means "determined to be sufficient if implemented well" or "determined to be implemented well.")

Something that was ambiguous in the RSP and is still ambiguous here: during training, if Anthropic reaches "3 months since last eval" before "4x since last eval," do they do evals? Or does the "3 months" condition only apply after training?

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

Some other hopes for the RSP, off the top of my head:

  • **ASL-4 definition + operationalization + mitigations, including generally how Anthropic will think about safety cases after the "no dangerous capabilities" safety case doesn't work anymore
  • Clarifying security commitments (when the RAND report on securing model weights comes out)
  • Dangerous capability evals by external auditors, e.g. METR
Comment by Zach Stein-Perlman on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-18T21:30:38.525Z · LW · GW

Edit: nevermind; maybe this tweet is misleading and narrow and just about restoring people's vested equity; I'm not sure what that means in the context of OpenAI's pseudo-equity but possibly this tweet isn't a big commitment.

@gwern I'm interested in your take on this new Altman tweet:

we have never clawed back anyone's vested equity, nor will we do that if people do not sign a separation agreement (or don't agree to a non-disparagement agreement). vested equity is vested equity, full stop.

there was a provision about potential equity cancellation in our previous exit docs; although we never clawed anything back, it should never have been something we had in any documents or communication. this is on me and one of the few times i've been genuinely embarrassed running openai; i did not know this was happening and i should have.

the team was already in the process of fixing the standard exit paperwork over the past month or so. if any former employee who signed one of those old agreements is worried about it, they can contact me and we'll fix that too. very sorry about this.

In particular "i did not know this was happening"

Comment by Zach Stein-Perlman on DeepMind's "​​Frontier Safety Framework" is weak and unambitious · 2024-05-18T17:14:32.790Z · LW · GW

Sorry for brevity.

We just disagree. E.g. you "walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks"; I felt like Anthropic was thinking about most stuff better.

I think Anthropic's ASL-3 is reasonable and OpenAI's thresholds and corresponding commitments are unreasonable. If the ASL-4 threshold was high or commitments are poor such that ASL-4 was meaningless, I agree Anthropic's RSP would be at least as bad as OpenAI's.

One thing I think is a big deal: Anthropic's RSP treats internal deployment like external deployment; OpenAI's has almost no protections for internal deployment.

I agree "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4" is also a fine characterization of Anthropic's current RSP.

Quick edit: PF thresholds are too high; PF seems doomed / not on track. But RSPv1 is consistent with RSPv1.1 being great. At least Anthropic knows and says there’s a big hole. That's not super relevant to evaluating labs' current commitments but is very relevant to predicting.

Comment by Zach Stein-Perlman on Akash's Shortform · 2024-05-18T17:00:08.975Z · LW · GW

Sorry for brevity, I'm busy right now.

  1. Noticing good stuff labs do, not just criticizing them, is often helpful. I wish you thought of this work more as "evaluation" than "criticism."
  2. It's often important for evaluation to be quite truth-tracking. Criticism isn't obviously good by default.


3. I'm pretty sure OP likes good criticism of the labs; no comment on how OP is perceived. And I think I don't understand your "good judgment" point. Feedback I've gotten on AI Lab Watch from senior AI safety people has been overwhelmingly positive, and of course there's a selection effect in what I hear, but I'm quite sure most of them support such efforts.

4. Conjecture (not exclusively) has done things that frustrated me, including in dimensions like being "'unilateralist,' 'not serious,' and 'untrustworthy.'" I think most criticism of Conjecture-related advocacy is legitimate and not just because people are opposed to criticizing labs.

5. I do agree on "soft power" and some of "jobs." People often don't criticize the labs publicly because they're worried about negative effects on them, their org, or people associated with them.

Comment by Zach Stein-Perlman on DeepMind's "​​Frontier Safety Framework" is weak and unambitious · 2024-05-18T05:35:59.693Z · LW · GW


Two weeks ago I sent a senior DeepMind staff member some "Advice on RSPs, especially for avoiding ambiguities"; #1 on my list was "Clarify how your deployment commitments relate to internal deployment, not just external deployment" (since it's easy and the OpenAI PF also did a bad job of this)


Edit: Rohin says "deployment" means external deployment by default and notes that the doc mentions "internal access" as distinct from deployment.

Comment by Zach Stein-Perlman on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-17T17:19:58.177Z · LW · GW

Added updates to the post. Updating it as stuff happens. Not paying much attention; feel free to DM me or comment with more stuff.

Comment by Zach Stein-Perlman on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-15T07:00:45.984Z · LW · GW

The commitment—"20% of the compute we've secured to date" (in July 2023), to be used "over the next four years"—may be quite little in 2027, with compute use increasing exponentially. I'm confused about why people think it's a big commitment.

Comment by Zach Stein-Perlman on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-15T05:40:18.108Z · LW · GW

Two other executives left two weeks ago, but that's not obviously safety-related.

Comment by Zach Stein-Perlman on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-14T05:50:49.955Z · LW · GW

Full quote:

We’ve evaluated GPT-4o according to our Preparedness Framework and in line with our voluntary commitments. Our evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. We tested both pre-safety-mitigation and post-safety-mitigation versions of the model, using custom fine-tuning and prompts, to better elicit model capabilities.

GPT-4o has also undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered.

[Edit after Simeon replied: I disagree with your interpretation that they're being intentionally very deceptive. But I am annoyed by (1) them saying "We’ve evaluated GPT-4o according to our Preparedness Framework" when the PF doesn't contain specific evals and (2) them taking credit for implementing their PF when they're not meeting its commitments.]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-05-13T23:10:01.719Z · LW · GW

How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn't have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:

  1. Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
    1. This is mostly what Anthropic's RSP does (at least so far — maybe it'll change when they define ASL-4)
  2. Use risk assessment techniques that evaluate safety given deployment safety practices
    1. This is mostly what OpenAI's PF is supposed to do (measure "post-mitigation risk"), but the details of their evaluations and mitigations are very unclear
  3. Do control evaluations
  4. Achieve alignment (and get strong evidence of that)

Related: RSPs, safety cases.

Maybe lots of risk comes from the lab using AIs internally to do AI development. The first two options are fine for preventing catastrophic misuse from external deployment but I worry they struggle to measure risks related to scheming and internal deployment.

Comment by Zach Stein-Perlman on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-13T19:41:47.404Z · LW · GW

Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts.

I'm disappointed and I think they shouldn't get much credit PF-wise: they haven't published their evals, published a report on results, or even published a high-level "scorecard." They are not yet meeting the commitments in their beta Preparedness Framework — some stuff is unclear but at the least publishing the scorecard is an explicit commitment.

(It's now been six months since they published the beta PF!)

[Edit: not to say that we should feel much better if OpenAI was successfully implementing its PF -- the thresholds are way too high and it says nothing about internal deployment.]

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-05-12T07:47:10.474Z · LW · GW

There should be points for how the organizations act wrt to legislation. In the SB 1047 bill that CAIS co-sponsored, we've noticed some AI companies to be much more antagonistic than others. I think [this] is probably a larger differentiator for an organization's goodness or badness.

If there's a good writeup on labs' policy advocacy I'll link to and maybe defer to it.

Comment by Zach Stein-Perlman on RobertM's Shortform · 2024-05-12T02:34:09.147Z · LW · GW

Adding to the confusion: I've nonpublicly heard from people at UK AISI and [OpenAI or Anthropic] that the Politico piece is very wrong and DeepMind isn't the only lab doing pre-deployment sharing (and that it's hard to say more because info about not-yet-deployed models is secret). But no clarification on commitments.

Comment by Zach Stein-Perlman on simeon_c's Shortform · 2024-05-11T05:41:40.170Z · LW · GW

But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren't the most important. (E.g. in your case.)

I've signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.

I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-05-11T05:35:08.409Z · LW · GW

Related: maybe a lab should get full points for a risky release if the lab says it's releasing because the benefits of [informing / scaring / waking-up] people outweigh the direct risk of existential catastrophe and other downsides. It's conceivable that a perfectly responsible lab would do such a thing.

Capturing all nuances can trade off against simplicity and legibility. (But my criteria are not yet on the efficient frontier or whatever.)

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-05-10T04:52:22.383Z · LW · GW

Thanks. I agree you're pointing at something flawed in the current version and generally thorny. Strong-upvoted and strong-agreevoted.

Generally, the deployment criteria should be gated behind "has a plan to do this when models are actually powerful and their implementation of the plan is credible".

I didn't put much effort into clarifying this kind of thing because it's currently moot—I don't think it would change any lab's score—but I agree.[1] I think e.g. a criterion "use KYC" should technically be replaced with "use KYC OR say/demonstrate that you're prepared to implement KYC and have some capability/risk threshold to implement it and [that threshold isn't too high]."

Don't pass cost benefit for current models which pose low risk. (And it seems the criteria is "do you have them implemented right now?) . . . .

(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I'm afraid.)

Yeah. The criteria can be like "implement them or demonstrate that you could implement them and have a good plan to do so," but it would sometimes be reasonable for the lab to not have done this yet. (Especially for non-frontier labs; the deployment criteria mostly don't work well for evaluating non-frontier labs. Also if demonstrating that you could implement something is difficult, even if you could implement it.)

I get the sense that this criteria doesn't quite handle the necessarily edge cases to handle reasonable choices orgs might make.

I'm interested in suggestions :shrug:

  1. ^

    And I think my site says some things that contradict this principle, like 'these criteria require keeping weights private.' Oops.

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-05-09T02:30:02.532Z · LW · GW

Two noncentral pages I like on the site:

Comment by Zach Stein-Perlman on Questions for labs · 2024-05-08T23:00:29.898Z · LW · GW

Yay @Zac Hatfield-Dodds of Anthropic for feedback and corrections including clarifying a couple of Anthropic's policies. Two pieces of not-previously-public information:

  • I was disappointed that Anthropic's Responsible Scaling Policy only mentions evaluation "During model training and fine-tuning." Zac told me "this was a simple drafting error - our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we've been acting accordingly all along." Yay.
  • I said labs should have a "process for staff to escalate concerns about safety" and "have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously." I noted that Anthropic's RSP includes a commitment to "Implement a non-compliance reporting policy." Zac told me "Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1." Yay.

I think it's cool that Zac replied (but most of my questions for Anthropic remain).

I have not yet received substantive corrections/clarifications from any other labs.

(I have made some updates to based on Zac's feedback—and revised Anthropic's score from 45 to 48—but have not resolved all of it.)

Comment by Zach Stein-Perlman on How do open AI models affect incentive to race? · 2024-05-07T02:54:45.061Z · LW · GW

I mostly agree. And I think when people say race dynamics they often actually mean speed of progress and especially "Effects of open models on ease of training closed models [and open models]," which you mention.

But here is a race-dynamics story:

Alice has the best open model. She prefers for AI progress to slow down but also prefers to have the best open model (for reasons of prestige or, if different companies' models are not interchangeable, future market share). Bob releases a great open model. This incentivizes Alice to release a new state-of-the-art model sooner.

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-05-05T21:14:24.133Z · LW · GW

Yep, lots of people independently complain about "lab." Some of those people want me to use scary words in other places too, like replacing "diffusion" with "proliferation." I wouldn't do that, and don't replace "lab" with "mega-corp" or "juggernaut," because it seems [incorrect / misleading / low-integrity].

I'm sympathetic to the complaint that "lab" is misleading. (And I do use "company" rather than "lab" occasionally, e.g. in the header.) But my friends usually talk about "the labs," not "the companies." But to most audiences "company" is more accurate.

I currently think "company" is about as good as "lab." I may change the term throughout the site at some point.

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-05-05T18:20:04.136Z · LW · GW

This kind of feedback is very helpful to me; thank you! Strong-upvoted and weak-agreevoted.

(I have some factual disagreements. I may edit them into this comment later.)

(If you think Dan's comment makes me suspect this project is full of issues/mistakes, react 💬 and I'll consider writing a detailed soldier-ish reply.)

Comment by Zach Stein-Perlman on Questions for labs · 2024-05-02T18:58:23.395Z · LW · GW

Thanks. Briefly:

I'm not sure what the theory of change for listing such questions is.

In the context of policy advocacy, think it's sometimes fine/good for labs to say somewhat different things publicly vs privately. Like, if I was in charge of a lab and believed (1) the EU AI Act will almost certainly pass and (2) it has some major bugs that make my life harder without safety benefits, I might publicly say "I support (the goals of) the EU AI Act" and privately put some effort into removing those bugs, which is technically lobbying to weaken the Act.

(^I'm not claiming that particular labs did ~this rather than actually lobby against the Act. I just think it's messy and regulation isn't a one-dimensional thing that you're for or against.)

Edit: this comment was misleading and partially replied to a strawman. I agree it would be good for the labs and their leaders to publicly say some things about recommended regulation (beyond what they already do) and their lobbying. I'm nervous about trying to litigate rumors for reasons I haven't explained.

Edit 2: based on,, and background information, I believe that OpenAI, Microsoft, Google, and Meta privately lobbied to make the EU AI Act worse—especially by lobbying against rules for foundation models—and that this is inconsistent with OpenAI's and Altman's public statements.

Comment by Zach Stein-Perlman on Questions for labs · 2024-05-01T17:59:50.013Z · LW · GW

This post is not trying to shame labs for failing to answer before; I didn't try hard to get them to answer. (The period was one week but I wasn't expecting answers to my email / wouldn't expect to receive a reply even if I waited longer.)

(Separately, I kinda hope the answers to basic questions like this are already written down somewhere...)

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-04-30T21:42:06.091Z · LW · GW

Google sheet.

Some overall scores are one point higher. Probably because my site rounds down. Probably my site should round to the nearest integer...

Comment by Zach Stein-Perlman on Introducing AI Lab Watch · 2024-04-30T21:32:22.827Z · LW · GW

Thanks for the feedback. I'll add "let people download all the data" to my todo list but likely won't get to it. I'll make a simple google sheet now.

Comment by Zach Stein-Perlman on Tamsin Leake's Shortform · 2024-04-22T01:59:56.429Z · LW · GW

This is too strong. For example, releasing the product would be correct if someone else would do something similar soon anyway and you're safer than them and releasing first lets you capture more of the free energy. (That's not the case here, but it's not as straightforward as you suggest, especially with your "Regardless of how good their alignment plans are" and your claim "There's just no good reason to do that, except short-term greed".)

Comment by Zach Stein-Perlman on Express interest in an "FHI of the West" · 2024-04-19T19:18:03.487Z · LW · GW

Constellation (which I think has some important FHI-like virtues, although makes different tradeoffs and misses on others)

What is Constellation missing or what should it do? (Especially if you haven't already told the Constellation team this.)

Comment by Zach Stein-Perlman on FHI (Future of Humanity Institute) has shut down (2005–2024) · 2024-04-18T06:00:01.117Z · LW · GW

Harry let himself be pulled, but as Hermione dragged him away, he said, raising his voice even louder, "It is entirely possible that in a thousand years, the fact that FHI was at Oxford will be the only reason anyone remembers Oxford!"

Comment by Zach Stein-Perlman on Staged release · 2024-04-17T16:32:17.891Z · LW · GW

Yes but possibly the lab has its own private scaffolding which is better for its model than any other existing scaffolding, perhaps because it trained the model to use its specific scaffolding, and it can initially not allow users to use that.

(Maybe it’s impossible to give API access to scaffolding and keep the scaffolding private? Idk.)

Edit: Plus what David says.

Comment by Zach Stein-Perlman on RTFB: On the New Proposed CAIP AI Bill · 2024-04-17T00:14:45.064Z · LW · GW

Suppose you can take an action that decreases net P(everyone dying) but increases P(you yourself kill everyone), and leaves all else equal. I claim you should take it; everyone is better off if you take it.

I deny "deontological injunctions." I want you and everyone to take the actions that lead to the best outcomes, not that keep your own hands clean. I'm puzzled by your expectation that I'd endorse "deontological injunctions."

This situation seems identical to the trolley problem in the relevant ways. I think you should avoid letting people die, not just avoid killing people.

[Note: I roughly endorse heuristics like if you're contemplating crazy-sounding actions for strange-sounding reasons, you should suspect that you're confused about your situation or the effects of your actions, and you should be more cautious than your naive calculations suggest. But that's very different from deontology.]

Comment by Zach Stein-Perlman on Anthropic AI made the right call · 2024-04-16T20:50:00.293Z · LW · GW

I guess I'm more willing to treat Anthropic's marketing as not-representing-Anthropic.

Like, when OpenAI marketing says GPT-4 is our most aligned model yet! you could say this shows that OpenAI deeply misunderstands alignment but I tend to ignore it. Even mostly when Sam Altman says it himself.

[Edit after habryka's reply: my weak independent impression is that often the marketing people say stuff that the leadership and most technical staff disagree with, and if you use marketing-speak to substantially predict what-leadership-and-staff-believe you'll make worse predictions.]

Comment by Zach Stein-Perlman on Anthropic AI made the right call · 2024-04-15T05:02:04.459Z · LW · GW


I guess I'm more willing to treat Anthropic's marketing as not-representing-Anthropic. Shrug. [Edit: like, maybe it's consistent-with-being-a-good-guy-and-basically-honest to exaggerate your product in a similar way to everyone else. (You risk the downsides of creating hype but that's a different issue than the integrity thing.)]

It is disappointing that Anthropic hasn't clarified its commitments after the post-launch confusion, one way or the other.

Comment by Zach Stein-Perlman on Anthropic AI made the right call · 2024-04-15T04:44:50.152Z · LW · GW

Sorry for using the poll to support a different proposition. Edited.

To make sure I understand your position (and Ben's):

  1. Dario committed to Dustin that Anthropic wouldn't "meaningfully advance the frontier" (according to Dustin)
  2. Anthropic senior staff privately gave AI safety people the impression that Anthropic would stay behind/at the frontier (although nobody has quotes)
  3. Claude 3 Opus meaningfully advanced the frontier? Or slightly advanced it but Anthropic markets it like it was a substantial advance so they're being similarly low-integrity?

...I don't think Anthropic violated its deployment commitments. I mostly believe y'all about 2—I didn't know 2 until people asserted it right after the Claude 3 release, but I haven't been around the community, much less well-connected in it, for long—but that feels like an honest miscommunication to me. If I'm missing "past Anthropic commitments" please point to them.

Comment by Zach Stein-Perlman on Anthropic AI made the right call · 2024-04-15T00:59:29.463Z · LW · GW

Most of us agree with you that deploying Claude 3 was reasonable, although I for one disagree with your reasoning. The criticism was mostly about the release being (allegedly) inconsistent with Anthropic's past statements/commitments on releases.

[Edit: the link shows that most of us don't think deploying Claude 3 increased AI risk, not think deploying Claude 3 was reasonable.]

Comment by Zach Stein-Perlman on RTFB: On the New Proposed CAIP AI Bill · 2024-04-11T02:36:25.487Z · LW · GW

"turning out to be right" is CAIS's strength

This is CAIP, not CAIS; CAIP doesn't really have a track record yet.

Comment by Zach Stein-Perlman on RTFB: On the New Proposed CAIP AI Bill · 2024-04-10T19:06:31.250Z · LW · GW

In addition to the bill, CAIP has a short summary and a long summary.

Comment by Zach Stein-Perlman on metachirality's Shortform · 2024-04-01T03:43:01.140Z · LW · GW

Try, turn on anti-kibbitzer, sort comments by New

Comment by Zach Stein-Perlman on My simple AGI investment & insurance strategy · 2024-04-01T00:01:42.664Z · LW · GW

If bid-ask spreads are large, consider doing so less often + holding calls that expire at different times so that every time you roll you're only rolling half of your calls.

Comment by Zach Stein-Perlman on OpenAI: Facts from a Weekend · 2024-03-27T02:54:14.864Z · LW · GW

@gwern I've failed to find a source saying that Hydrazine invested in OpenAI. If it did, that would be a big deal; it would make this a lie.