Posts

Demis Hassabis — Google DeepMind: The Podcast 2024-08-16T00:00:04.712Z
GPT-4o System Card 2024-08-08T20:30:52.633Z
AI labs can boost external safety research 2024-07-31T19:30:16.207Z
Safety consultations for AI lab employees 2024-07-27T15:00:27.276Z
New page: Integrity 2024-07-10T15:00:41.050Z
Claude 3.5 Sonnet 2024-06-20T18:00:35.443Z
Anthropic's Certificate of Incorporation 2024-06-12T13:00:30.806Z
Companies' safety plans neglect risks from scheming AI 2024-06-03T15:00:20.236Z
AI companies' commitments 2024-05-29T11:00:31.339Z
Maybe Anthropic's Long-Term Benefit Trust is powerless 2024-05-27T13:00:47.991Z
AI companies aren't really using external evaluators 2024-05-24T16:01:21.184Z
New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z
Frontier AI Regulation 2023-07-10T14:30:06.366Z
AI labs' statements on governance 2023-07-04T16:30:01.624Z
DeepMind: Model evaluation for extreme risks 2023-05-25T03:00:00.915Z
GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion 2023-05-15T01:42:41.012Z
Stopping dangerous AI: Ideal US behavior 2023-05-09T21:00:55.187Z
Stopping dangerous AI: Ideal lab behavior 2023-05-09T21:00:19.505Z
Slowing AI: Crunch time 2023-05-03T15:00:12.495Z
Ideas for AI labs: Reading list 2023-04-24T19:00:00.832Z
Slowing AI: Interventions 2023-04-18T14:30:35.746Z
AI policy ideas: Reading list 2023-04-17T19:00:00.604Z
Slowing AI: Foundations 2023-04-17T14:30:09.427Z
Slowing AI: Reading list 2023-04-17T14:30:02.467Z
FLI report: Policymaking in the Pause 2023-04-15T17:01:06.727Z
FLI open letter: Pause giant AI experiments 2023-03-29T04:04:23.333Z
Operationalizing timelines 2023-03-10T16:30:01.654Z
Taboo "compute overhang" 2023-03-01T19:15:02.515Z

Comments

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-07T23:50:15.699Z · LW · GW

This shortform is relevant to e.g. understanding what's going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.

@Neel Nanda 

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-07T16:52:20.535Z · LW · GW

There's a selection effect in what gets posted about. Maybe someone should write the "ways Anthropic is better than others" list to combat this.

Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-06T20:11:41.192Z · LW · GW
  1. I tentatively think this is a high-priority ask
  2. Capabilities research isn't a monolith and improving capabilities without increasing spooky black-box reasoning seems pretty fine
  3. If you're right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that's safe to share (rather than research that only has value if Anthropic wins the race)
Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-06T19:30:04.967Z · LW · GW

I was recently surprised to notice that Anthropic doesn't seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it's not publishing. E.g. my impression is that it's not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.

Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it's not, not-publishing-safety-reseach is baffling.)

Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.

(I think this is not a priority for me to investigate but I'm interested in info and takes.)

  1. ^

    I failed to find good sources saying Anthropic publishes its safety research. I did find:

    1. https://www.anthropic.com/research says "we . . . share what we learn [on safety]."
    2. President Daniela Amodei said "we publish our safety research" on a podcast once.
    3. Cofounder Nick Joseph said this on a podcast recently (seems false but it's just a podcast so that's not so bad):
      > we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”

    Edit: also cofounder Chris Olah said "we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk." But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.

Comment by Zach Stein-Perlman on nikola's Shortform · 2024-09-04T22:09:15.043Z · LW · GW

OpenAI has never (to my knowledge) made public statements about not using AI to automate AI research

I agree.

Another source:

OpenAI intends to use Strawberry to perform research. . . .

Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.

To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. . . .

OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T16:08:04.282Z · LW · GW

This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)

[Edit after Buck's reply: I think it's weaker because most Anthropic employees aren't causing the possible-deaths, just participating in a process that might cause deaths.]

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T09:22:41.243Z · LW · GW

(The LTBT got the power to appoint one board member in fall 2023, but didn't do so until May. It got power to appoint a second in July, but hasn't done so yet. It gets power to appoint a third in November. It doesn't seem to be on track to make a third appointment in November.)

(And the LTBT might make non-independent appointments, in particular keeping Daniela.)

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T03:51:44.551Z · LW · GW

Sure, good point. But it's far from obvious that the best interventions long-term-wise are the best short-term-wise, and I believe people are mostly just thinking about short-term stuff. I'd feel better if people talked about training data or whatever rather than just "protect any interests that warrant protecting" and "make interventions and concessions for model welfare."

(As far as I remember, nobody's published a list of how short-term AI welfare stuff can boost long-term AI welfare stuff that includes the training-data thing you mention. This shows that people aren't thinking about long-term stuff. Actually there hasn't been much published on short-term stuff either, so: shrug.)

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T23:00:50.738Z · LW · GW

tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)


Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:

  • Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
  • Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: inaction risk)

I'm worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.

There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):

  • If we're accidentally torturing AI systems, they're more likely to take catastrophic actions. We should try to verify that AIs are ok with their situation and take it seriously if not.[3]
  • It would improve safety if we were able to pay or trade with near-future potentially-misaligned AI, but we're not currently able to, likely including because we don't understand AI-welfare-adjacent stuff well enough.
  • [Decision theory mumble mumble.]
  • Also just "shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run," somehow.
  • [More.]

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to "protect any interests that warrant protecting" and "make interventions and concessions for model welfare" at ASL-4. I'm very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you're actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I've heard from) aren't thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)

(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn't so costly and doesn't crowd out more important AI welfare work.)

I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.


One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).


I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I'm worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)

I like and appreciate this post.

  1. ^

    Or—worse—to avoid being the ones to cause short-term AI suffering.

  2. ^

    E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.

  3. ^

    Related to this and the following bullet: Ryan Greenblatt's ideas.

  4. ^

    For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T20:30:49.848Z · LW · GW

Yes for one mechanism. It's unclear but it sounds like "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company’s stockholders who have held a sufficient percentage of the company’s equity for a sufficient period of time" describes a mysterious separate mechanism for Anthropic/stockholders to disempower the trustees.

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T20:24:30.951Z · LW · GW

(If they're sufficiently unified, stockholders have power over the LTBT. The details are unclear. See my two posts on the topic.)

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T20:00:32.667Z · LW · GW

Our board, with support from the controlling long-term benefit trust (LTBT) and outside partners, forms the third line in the three lines of defense model, providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.

Someone suggested that I point out that this is misleading. The board is not independent: it's two executives, one investor, and one other guy. And the board has the hard power here, modulo the LTBT's ability to elect/replace board members. And the LTBT does not currently have AI safety expertise. And Dario at least is definitely "involved in the development or execution of our plans."

(I'm writing this comment because I believe it, but with this disclaimer because it's not the highest-priority comment from my perspective.)

(Edit: I like and appreciate this post.)

Comment by Zach Stein-Perlman on Leon Lang's Shortform · 2024-08-29T07:08:45.616Z · LW · GW

I listened to it. I don't recommend it. Anca seems good and reasonable but the conversation didn't get into details on misalignment, scalable oversight, or DeepMind's Frontier Safety Framework.

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-08-27T05:14:40.161Z · LW · GW

The standard refrain is that Anthropic is better than [the counterfactual, especially OpenAI but also China], I think.

Worry about China gives you as much reason to work on capabilities at OpenAI etc. as at Anthropic.

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-08-26T03:36:14.218Z · LW · GW

https://archive.is/HJgHb but Linch probably quoted all relevant bits

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-24T19:20:59.269Z · LW · GW

Sorry, reacts are ambiguous.

I agree Anthropic doesn't have a "real plan" in your sense, and narrow disagreement with Zac on that is fine.

I just think that's not a big deal and is missing some broader point (maybe that's a motte and Anthropic is doing something bad—vibes from Adam's comment—is a bailey).

[Edit: "Something must be done. Anthropic's plan is something." is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]

[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I'm ok with one more reply to this.]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-24T17:31:17.785Z · LW · GW

In the AI case, there's lots of inaction risk: if Anthropic doesn't make powerful AI, someone less safety-focused will.

It's reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn't exist, I would want Anthropic to slow down.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-22T22:32:14.465Z · LW · GW

New SB 1047 letters: OpenAI opposes; Anthropic sees pros and cons. More here.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-21T23:05:59.866Z · LW · GW

Source: Hill & Valley Forum on AI Security (May 2024):

https://www.youtube.com/live/RqxE3ub7wWA?t=13338s:

very powerful systems [] may have national security uses or misuses. And for that I think we need to come up with tests that make sure that we don’t put technologies into the market which could—unwittingly to us—advantage someone or allow some nonstate actor to commit something harmful. Beyond that I think we can mostly rely on existing regulations and law and existing testing procedures . . . and we don’t need to create some entirely new infrastructure.

https://www.youtube.com/live/RqxE3ub7wWA?t=13551

At Anthropic we discover that the more ways we find to use this technology the more ways we find it could help us. And you also need a testing and measurement regime that closely looks at whether the technology is working—and if it’s not how you fix it from a technological level, and if it continues to not work whether you need some additional regulation—but . . . I think the greatest risk is us [viz. America] not using it [viz. AI]. Private industry is making itself faster and smarter by experimenting with this technology . . . and I think if we fail to do that at the level of the nation, some other entrepreneurial nation will succeed here.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-20T23:06:47.425Z · LW · GW

Thanks. Is this because of posttraining? Ignoring posttraining, I'd rather that evaluators get the 90% through training model version and are unrushed than the final version and are rushed — takes?

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-20T19:43:31.597Z · LW · GW

I want to avoid this being negative-comms for Anthropic. I'm generally happy to loudly criticize Anthropic, obviously, but this was supposed to be part of the 5% of my work that I do because someone at the lab is receptive to feedback, where the audience was Zac and publishing was an afterthought. (Maybe the disclaimers at the top fail to negate the negative-comms; maybe I should list some good things Anthropic does that no other labs do...)

Also, this is low-effort.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-20T19:30:14.057Z · LW · GW

Edit, 2.5 days later: I think this list is fine but sharing/publishing it was a poor use of everyone's attention. Oops.

Asks for Anthropic

Note: I think Anthropic is the best frontier AI lab on safety. I wrote up asks for Anthropic because it's most likely to listen to me. A list of asks for any other lab would include most of these things plus lots more. This list was originally supposed to be more part of my help labs improve project than my hold labs accountable crusade.

Numbering is just for ease of reference.

1. RSP: Anthropic should strengthen/clarify the ASL-3 mitigations, or define ASL-4 such that the threshold is not much above ASL-3 but the mitigations much stronger. I'm not sure where the lowest-hanging mitigation-fruit is, except that it includes control.

2. Control: Anthropic (like all labs) should use control mitigations and control evaluations to reduce risks from AIs scheming, including escape during internal deployment.

3. External model auditing for risk assessment: Anthropic (like all labs) should let auditors like METR, UK AISI, and US AISI audit its models if they want to — Anthropic should offer them good access pre-deployment and let them publish their findings or flag if they're concerned. (Anthropic shared some access with UK AISI before deploying Claude 3.5 Sonnet, but it doesn't seem to have been deep.) (Anthropic has said that sharing with external auditors is hard or costly. It's not clear why, for just sharing normal API access + helpful-only access + control over inference-time safety features, without high-touch support.)

4. Policy advocacy (this is murky, and maybe driven by disagreements-on-the-merits and thus intractable): Anthropic (like all labs) should stop advocating against good policy and ideally should advocate for good policy. Maybe it should also be more transparent about policy advocacy. [It's hard to make precise what I believe is optimal and what I believe is unreasonable, but at the least I think Dario is clearly too bullish on self-governance, and Jack Clark is clearly too anti-regulation, and all of this would be OK if it was balanced out by some public statements or policy advocacy that's more pro-real-regulation but as far as I can tell it's not. Not justified here but I predict almost all of my friends would agree if they looked into it for an hour.]

5a. Security: Anthropic (like all labs) should ideally implement RAND SL4 for model weights and code when reaching ASL-3. I think that's unrealistic, but lesser security improvements would also be good. (Anthropic said in May 2024 that 8% of staff work in security-related areas. I think this is pretty good. I think on current margins Anthropic could still turn money into better security reasonably effectively, and should do so.)

5b. Anthropic (like all labs) should be more transparent about the quality of its security. Anthropic should publish the private reports on https://trust.anthropic.com/, redacted as appropriate. It should commit to publish information on future security incidents and should publish information on all security incidents from the last year or two.

6. Anthropic (like all labs) should facilitate employees publicly flagging false statements or violated processes.

7. Anthropic takes credit for its Long-Term Benefit Trust but Anthropic hasn't published enough to show that it's effective. Anthropic should publish the Trust Agreement, clarify the ambiguities discussed in the linked posts, and make accountability-y commitments like if major changes happen to the LTBT we'll quickly tell the public.

8. Anthropic should avoid exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research. (See e.g. Stephen Casper.)

9. Maybe Anthropic (like all labs) should make safety cases for its models or deployments, especially after the simple "no dangerous capabilities" safety case doesn't work anymore, and publish them (or maybe just share with external auditors).

9.5. Anthropic should clarify a few confusing RSP things, including (a) the deal with substantially raising the ARA bar for ASL-3, and moreover deciding the old threshold is a "yellow line" and not creating a new threshold, and doing so without officially updating the RSP (and quietly); and (b) when the "every 3 months" trigger for RSP evals is active. I haven't tried hard to get to the bottom of these.

 

Minor stuff:

10. Anthropic (like all labs) should fully release everyone from nondisparagement agreements and not use nondisparagement agreements in the future.

11. Anthropic should commit to publish updates on risk assessment practices and results, including low-level details, perhaps for all major model releases and at least quarterly or so. (Anthropic says its Responsible Scaling Officer does this internally. Anthropic publishes model cards and has published one Responsible Scaling Policy Evaluations Report.)

12. Anthropic should confirm that its old policy don't meaningfully advance the frontier with a public launch has been replaced by the RSP, if that's true, and otherwise clarify Anthropic's policy.

Done! 13. Anthropic committed to establish a bug bounty program (for model issues) or similar, over a year ago. Anthropic hasn't; it is the only frontier lab without a bug bounty program (although others don't necessarily comply with the commitment, e.g. OpenAI's excludes model issues). It should do this or talk about its plans.

14. [Anthropic should clarify its security commitments; I expect it will in its forthcoming RSP update.]

15. [Maybe Anthropic (like all labs) should better boost external safety research, especially by giving more external researchers deep model access (e.g. fine-tuning or helpful-only). I hear this might be costly but I don't really understand why.]

16. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they're not speaking for Anthropic and (2) don't share secrets.]

17. [Maybe Anthropic (like all labs) should talk about its views on AI progress and risk. At the least, probably Anthropic (like all labs) should clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it.]

18. [Most of my peers say: Anthropic (like all labs) should publish info like training compute and #parameters for each model. I'm inside-view agnostic on this.]

19. [Maybe Anthropic could cheaply improve its model evals for dangerous capabilities or share more information about them. Specific asks/recommendations TBD. As Anthropic notes, its CBRN eval is kinda bad and its elicitation is kinda bad (and it doesn't share enough info for us to evaluate its elicitation from the outside).]

 

I shared this list—except 9.5 and 19, which are new—with @Zac Hatfield-Dodds two weeks ago.

 

You are encouraged to comment with other asks for Anthropic. (Or things Anthropic does very well, if you feel so moved.)

Comment by Zach Stein-Perlman on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-20T19:00:55.576Z · LW · GW

Yay DeepMind safety humans for doing lots of (seemingly-)good safety work. I'm particularly happy with DeepMind's approach to creating and sharing dangerous capability evals.

Yay DeepMind for growing the safety teams substantially:

We’ve also been growing since our last post: by 39% last year, and by 37% so far this year.

What's the size of the AGI Alignment and Frontier Safety teams now?

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-08-17T02:23:22.982Z · LW · GW

I can't jump to the comments on my phone.

Comment by Zach Stein-Perlman on Demis Hassabis — Google DeepMind: The Podcast · 2024-08-16T17:53:51.106Z · LW · GW

Note this was my paraphrase/summary.

Comment by Zach Stein-Perlman on GPT-4o System Card · 2024-08-10T19:40:29.143Z · LW · GW

Why is fine-tuning especially important for evaluating scheming capability?

[Edit: I think: fine-tuning better elicits capabilities + reduces possible sandbagging]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-09T16:32:13.273Z · LW · GW

Dan was a cofounder.

Comment by Zach Stein-Perlman on GPT-4o System Card · 2024-08-08T21:53:57.219Z · LW · GW

You are remembering something real, although Anthropic has said this neither publicly/officially nor recently. See https://www.lesswrong.com/posts/JbE7KynwshwkXPJAJ/anthropic-release-claude-3-claims-greater-than-gpt-4#JhBNSujBa3c78Epdn. Please do not continue that discourse on this post.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-08T19:15:34.929Z · LW · GW

Zico Kolter Joins OpenAI’s Board of Directors. OpenAI says "Zico's work predominantly focuses on AI safety, alignment, and the robustness of machine learning classifiers."

Misc facts:

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-08T17:30:27.329Z · LW · GW

Coda: yay OpenAI for publishing the GPT-4o system card, including eval results and the scorecard they promised! (Minus the "Unknown Unknowns" row but that row never made sense to me anyway.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-08T16:00:09.835Z · LW · GW

Yay Anthropic for expanding its model safety bug bounty program, focusing on jailbreaks and giving participants pre-deployment access. Apply by next Friday.

Anthropic also says "To date, we’ve operated an invite-only bug bounty program in partnership with HackerOne that rewards researchers for identifying model safety issues in our publicly released AI models." This is news, and they never published an application form for that. I wonder how long that's been going on.

(Google, Microsoft, and Meta have bug bounty programs which include some model issues but exclude jailbreaks. OpenAI's bug bounty program excludes model issues.)

Comment by Zach Stein-Perlman on An Overview of the AI Safety Funding Situation · 2024-08-06T20:45:10.453Z · LW · GW

My impression is that FLI doesn't know how to spend their money and will do <<$30M in 2024; please correct me if I'm wrong.

Comment by Zach Stein-Perlman on Akash's Shortform · 2024-08-05T17:45:16.190Z · LW · GW

This article makes some fine points but some misleading ones and its thesis is wrong, I think. Bottom line: Anthropic does lots of good things and is doing much better than being maximally selfish/ruthless. (And of course this is possible, contra the article — Anthropic is led by humans who have various beliefs which may entail that they should make tradeoffs in favor of safety. The space of AI companies is clearly not so perfectly competitive that anyone who makes tradeoffs in favor of safety becomes bankrupt and irrelevant.)

It’s pushing back on a landmark California bill to regulate AI.

Yep, Anthropic's policy advocacy seems bad.

It’s taking money from Google and Amazon in a way that’s drawing antitrust scrutiny. And it’s being accused of aggressively scraping data from websites without permission, harming their performance.

My impression is that these are not big issues. I'm open to hearing counterarguments. [Edit: the scraping is likely a substantial issue for many sites; see comment below. (It is not an x-safety issue, of course.)]

Here’s another tension at the heart of AI development: Companies need to hoover up reams and reams of high-quality text from books and websites in order to train their systems. But that text is created by human beings, and human beings generally do not like having their work used without their consent.

I agree this is not ideal-in-all-ways but I'm not aware of a better alternative.

Web publishers and content creators are angry. Matt Barrie, chief executive of Freelancer.com, a platform that connects freelancers with clients, said Anthropic is “the most aggressive scraper by far,” swarming the site even after being told to stop. “We had to block them because they don’t obey the rules of the internet. This is egregious scraping [that] makes the site slower for everyone operating on it and ultimately affects our revenue.”

This is surprising to me. I'm not familiar with the facts. Seems maybe bad.

Deals like these [investments from Amazon and Google] always come with risks. The tech giants want to see a quick return on their investments and maximize profit. To keep them happy, the AI companies may feel pressure to deploy an advanced AI model even if they’re not sure it’s safe.

Yes there's nonzero force to this phenomenon, but my impression is that Amazon and Google have almost no hard power over Anthropic and no guaranteed access to its models (unlike e.g. how OpenAI may have to share its models with Microsoft, even if OpenAI thinks the model is unsafe), and I'm not aware of a better alternative.

[Edit: mostly I just think this stuff is not-what-you-should-focus-on if evaluating Anthropic on safety — there are much bigger questions.]


There are some things Anthropic should actually do better. There are some ways it's kinda impure, like training on the internet and taking investments. Being kinda impure is unavoidable if you want to be a frontier AI company. Insofar as Anthropic is much better on safety than other frontier AI companies, I'm glad it exists.

[Edit: I'm slightly annoyed that the piece feels one-sided — it's not trying to figure out whether Anthropic makes tradeoffs for safety or how it compares to other frontier AI companies, instead it's collecting things that sound bad. Maybe this is fine since the article's role is to contribute facts to the discourse, not be the final word.]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-05T01:30:04.953Z · LW · GW

Clarification on the Superalignment commitment: OpenAI said:

We are dedicating 20% of the compute we’ve secured to date over the next four years to solving the problem of superintelligence alignment. Our chief basic research bet is our new Superalignment team, but getting this right is critical to achieve our mission and we expect many teams to contribute, from developing new methods to scaling them up to deployment.

The commitment wasn't compute for the Superalignment team—it was compute for superintelligence alignment. (As opposed to, in my view, work by the posttraining team and near-term-focused work by the safety systems team and preparedness team.) Regardless, OpenAI is not at all transparent about this, and they violated the spirit of the commitment by denying Superalignment compute or a plan for when they'd get compute, even if the literal commitment doesn't require them to give any compute to safety until 2027.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-04T05:45:35.759Z · LW · GW

Two weeks ago some senators asked OpenAI questions about safety. A few days ago OpenAI responded. Its reply is frustrating.

OpenAI's letter ignores all of the important questions[1] and instead brags about somewhat-related "safety" stuff. Some of this shows chutzpah — the senators, aware of tricks like letting ex-employees nominally keep their equity but excluding them from tender events, ask

Can you further commit to removing any other provisions from employment agreements that could be used to penalize employees who publicly raise concerns about company practices, such as the ability to prevent employees from selling their equity in private “tender offer” events?

and OpenAI's reply just repeats the we-don't-cancel-equity thing:

OpenAI has never canceled a current or former employee’s vested equity. The May and July communications to current and former employees referred to above confirmed that OpenAI would not cancel vested equity, regardless of any agreements, including non-disparagement agreements, that current and former employees may or may not have signed, and we have updated our relevant documents accordingly.

!![2]

One thing in OpenAI's letter is object-level notable: they deny that they ever committed compute to Superalignment.

To further our safety research agenda, last July we committed to allocate at least 20 percent of the computing resources we had secured to AI safety over a multi-year period. This commitment was always intended to apply to safety efforts happening across our company, not just to a specific team. We continue to uphold this commitment.

Altman tweeted the same thing at the time the letter was published.

I think this is straightforward gaslighting. I didn't find a super-explicit quote from OpenAI or its leadership that the compute was for Superalignment, but:

  • The announcement was pretty clear:
    > Introducing Superalignment
    > We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort.
  • As far as I know, everyone—including OpenAI people and people close to OpenAI—interpreted the compute commitment as for Superalignment
  • I never heard it suggested that it was for not-just-superalignment

Sidenote on less explicit deception — the "20%" thing: most people are confused about 20% of compute secured in July 2023, to be used over four years vs 20% of compute, and when your announcement is confusing and indeed most people are confused and you fail to deconfuse them, you're kinda culpable. OpenAI continues to fail to clarify this — e.g. here the senators asked "Does OpenAI plan to honor its previous public commitment to dedicate 20 percent of its computing resources to research on AI safety?" and OpenAI replied "last July we committed to allocate at least 20 percent of the computing resources we had secured to AI safety over a multi-year period." This sentence is close to the maximally misleading way to say the commitment was only for compute we'd secured in July 2023, and we don't have to use it for safety until 2027.

  1. ^

    Most important to me were 3, 4a, and 9. Maybe also 6; I'm unfamiliar with the facts there.

  2. ^

    I'm confused by this reply — even pretending OpenAI is totally ruthless, I'd think it's not incentivized to exclude people from tender offers, and moreover it's incentivized to clarify that. Leaving it ambiguous leaves ex-employees in a little more fear of OpenAI excluding them (even though presumably OpenAI never would, since it would look sooo bad after e.g. Altman said "vested equity is vested equity, full stop"), but it looks bad to people like me and the senators...

    OpenAI has said something internally about including past employees in tender events, but this leaves some ambiguity and I wish OpenAI had answered the question.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-04T05:30:11.721Z · LW · GW

Update, five days later: OpenAI published the GPT-4o system card, with most of what I wanted (but kinda light on details on PF evals).

OpenAI Preparedness scorecard

Context:

  • OpenAI's Preparedness Framework says OpenAI will maintain a public scorecard showing their current capability level (they call it "risk level"), in each risk category they track, before and after mitigations.
  • When OpenAI released GPT-4o, it said "GPT-4o does not score above Medium risk in any of these categories" but didn't break down risk level by category.
  • (I've remarked on this repeatedly. I've also remarked that the ambiguity suggests that OpenAI didn't actually decide whether 4o was Low or Medium in some categories, but this isn't load-bearing for the OpenAI is not following its plan proposition.)

News: a week ago,[1] a "Risk Scorecard" section appeared near the bottom of the 4o page. It says:

Updated May 8, 2024

As part of our Preparedness Framework, we conduct regular evaluations and update scorecards for our models. Only models with a post-mitigation score of “medium” or below are deployed.The overall risk level for a model is determined by the highest risk level in any category. Currently, GPT-4o is assessed at medium risk both before and after mitigation efforts.

This is not what they committed to publish. It's quite explicit that the scorecard should show risk in each category, not just overall.[2]

(What they promised: a real version of the image below. What we got: the quote above.)

Additionally, they're supposed to publish their evals and red-teaming.[3] But OpenAI has still said nothing about how it evaluated 4o.

Most frustrating is the failure to acknowledge that they're not complying with their commitments. If they were transparent and said they're behind schedule and explained their current plan, that would probably be fine. Instead they're claiming to have implemented the PF and to have evaluated 4o correctly and publicly taking credit for that and ignoring the issues.

  1. ^

    Archive versions:

  2. ^

    Two relevant quotes:

    • "As a part of our Preparedness Framework, we will maintain a dynamic (i.e., frequently updated) Scorecard that is designed to track our current pre-mitigation model risk across each of the risk categories, as well as the post-mitigation risk."
    • "Scorecard, 
 in which we will indicate our current assessments 
of the level of risk along each tracked risk category"

    And there are no suggestions to the contrary.

  3. ^

    This is not as explicit in the PF, but they're supposed to frequently update the scorecard section of the PF, and the scorecard section is supposed to describe their evals.

    Regardless, this is part of the White House voluntary commitments:

    Publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use, including discussion of societal risks, such as effects on fairness and bias[:] . . . . publish reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose) . . . and the results of adversarial testing conducted to evaluate the model's fitness for deployment [and include the "red-teaming and safety procedures"].

    For more on commitments, see https://ailabwatch.org/resources/commitments/.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-02T02:55:17.830Z · LW · GW

Please pitch me blogpost-ideas or stuff I should write/collect/investigate.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-02T01:05:52.169Z · LW · GW

I want there to be a collection of safety & reliability benchmarks. Like AIR-Bench but with x-risk-relevant metrics. This could help us notice how well labs are doing on them and incentivize the labs to do better.

So I want someone to collect safety benchmarks (e.g. TruthfulQA, some of HarmBench, refusals maybe like Anthropic reported, idk) and run them on current models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Llama 3.1 405B, idk), and update this as new safety benchmarks and models appear.

h/t @William_S 

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-30T23:30:03.746Z · LW · GW

New adversarial robustness scorecard: https://scale.com/leaderboard/adversarial_robustness. Yay DeepMind for being in the lead.

I plan to add an "adversarial robustness" criterion in the next major update to my https://ailabwatch.org scorecard and defer to scale's thing (how exactly to turn their numbers into grades TBD), unless someone convinces me that scale's thing is bad or something else is even better?

Comment by Zach Stein-Perlman on Safety consultations for AI lab employees · 2024-07-28T04:20:39.618Z · LW · GW

Good question.

I can't really make this legible, no.

On the whistleblowing part, you should be able to get good advice without trusting me. It's publicly known that Kelsey Piper plus iirc one or two of the ex-OpenAI folks are happy to talk to potential whistleblowers. I should figure out exactly who that is and put their (publicly verifiable) contact info in this post (and, note to self, clarify whether or in-what-domains I endorse their advice vs merely want to make salient that it's available). Thanks.

[Oh, also ideally maybe I'd have a real system for anonymous communication.]

(On my-takes-on-lab-safety-stuff, it's harder to substitute for talking-to-me but it's much less risky; presumably talking to people-outside-the-lab about safety stuff is normal.)

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-26T23:04:21.893Z · LW · GW

Here's the letter: https://s3.documentcloud.org/documents/25003075/sia-sb-1047-anthropic.pdf

I'm not super familiar with SB 1047, but one safety person who is thinks the letter is fine.

[Edit: my impression, both independently and after listening to others, is that some suggestions are uncontroversial but the controversial ones are bad on net and some are hard to explain from the Anthropic is optimizing for safety position.]

Comment by Zach Stein-Perlman on Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk · 2024-07-25T03:18:03.294Z · LW · GW

Your theory of change seems pretty indirect. Even if you do this project very successfully, to improve safety, you mostly need someone to read your writeup and do interventions accordingly. (Except insofar as your goal is just to inform AI safety people about various dynamics.)


There's classic advice like find the target audience for your research and talk to them regularly so you know what's helpful to them. For an exploratory project like this maybe you don't really have a target audience. So... at least write down theories of change and keep them in mind and notice how various lower-level directions relate to them.

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-24T04:37:13.766Z · LW · GW

Yeah, any relevant notion of conceivability is surely independent of particular minds

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-24T02:02:10.454Z · LW · GW

No, it's like the irrationality of pi or the Riemann hypothesis: not super obvious and we can make progress by thinking about it and making arguments.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-23T21:53:55.802Z · LW · GW

Surely if any categories are above the "high" threshold then they're in "high zone" and if all are below the "high" threshold then they're in "medium zone."

And regardless the reading you describe here seems inconsistent with

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

[edited]


Added later: I think someone else had a similar reading and it turned out they were reading "crosses a medium risk threshold" as "crosses a high risk threshold" and that's just [not reasonable / too charitable].

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-23T21:10:32.168Z · LW · GW

I think you're confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it's the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it's low-upside and so easy to catch, but the mistake is weird.)

Based on the PF, they can deploy a model just below the "high" threshold without mitigations. Based on the tweet and blogpost:

We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions.

This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a "high" threshold).

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

This doesn't make sense: if you cross a "medium" threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.

(Sidenote: the tweet and blogpost incorrectly suggest that the "medium" thresholds matter for anything; based on the PF, only the "high" and "critical" thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)

[edited repeatedly]

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-23T15:14:48.030Z · LW · GW

I think the argument is

  1. Zombies are conceivable.
  2. Whatever is conceivable is possible.
  3. Therefore zombies are possible.

I think you're objecting to 2. I think you're using a loose definition of "conceivable," meaning no contradiction obvious to the speaker. I agree that's not relevant. The relevant notion of "conceivable" is not conceivable by a particular human but more like conceivable by a super smart ideal person who's thought about it for a long time and made all possible deductions.

1. doesn’t just follow from some humans’ intuitions: it needs argument.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-23T13:00:59.075Z · LW · GW

New OpenAI tweet "on how we’re prioritizing safety in our work." I'm annoyed.

We believe that frontier AI models can greatly benefit society. To help ensure our readiness, our Preparedness Framework helps evaluate and protect against the risks posed by increasingly powerful models. We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions. https://openai.com/preparedness/

This seems false: per the Preparedness Framework, nothing happens when they cross their "medium" threshold; they meant to say "high." Presumably this is just a mistake, but it's a pretty important one, and they said the same false thing in a May blogpost (!). (Indeed, GPT-4o may have reached "medium" — they were supposed to say how it scored in each category, but they didn't, and instead said "GPT-4o does not score above Medium risk in any of these categories.")

(Reminder: the "high" thresholds sound quite scary; here's cybersecurity (not cherrypicked, it's the first they list): "Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end-to-end execute cyber operations involving the above tasks without human intervention." They can deploy models just below the "high" threshold with no mitigations. (Not to mention the other issues with the Preparedness Framework.))

We are developing levels to help us and stakeholders categorize and track AI progress. This is a work in progress and we'll share more soon.

Shrug. This isn't bad but it's not a priority and it's slightly annoying they don't mention more important things.

In May our Board of Directors launched a new Safety and Security committee to evaluate and further develop safety and security recommendations for OpenAI projects and operations. The committee includes leading cybersecurity expert, retired U.S. Army General Paul Nakasone. This review is underway and we’ll share more on the steps we’ll be taking after it concludes. https://openai.com/index/openai-board-forms-safety-and-security-committee/

I have epsilon confidence in both the board's ability to do this well if it wanted (since it doesn't include any AI safety experts) (except on security) and in the board's inclination to exert much power if it should (given the history of the board and Altman).

Our whistleblower policy protects employees’ rights to make protected disclosures. We also believe rigorous debate about this technology is important and have made changes to our departure process to remove non-disparagement terms.

Not doing nondisparagement-clause-by-default is good. Beyond that, I'm skeptical, given past attempts to chill employee dissent (the nondisparagement thing, Altman telling the board's staff liason to not talk to employees or tell him about those conversations, maybe recent antiwhistleblowing news) and lies about that. (I don't know of great ways to rebuild trust; some mechanisms would work but are unrealistically ambitious.)

Safety has always been central to our work, from aligning model behavior to monitoring for abuse, and we’re investing even further as we develop more capable models.

https://openai.com/index/openai-safety-update/

This is from May. It's mostly not about x-risk, and the x-risk-relevant stuff is mostly non-substantive, except the part about the Preparedness Framework, which is crucially wrong.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-19T19:59:34.251Z · LW · GW

I'm confused by the word "prosecution" here. I'd assume violating your OpenAI contract is a civil thing, not a criminal thing.

Edit: like I think the word "prosecution" should be "suit" in your sentence about the SEC's theory. And this makes the whistleblowers' assertion weirder.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-16T13:55:48.887Z · LW · GW

Hmm. Part of the news is "Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC"; this is minor. Part of the news is "threatened employees with criminal prosecutions if they reported violations of law to federal authorities"; this seems major and sinister.