Posts

Model evals for dangerous capabilities 2024-09-23T11:00:00.866Z
OpenAI o1 2024-09-12T17:30:31.958Z
Demis Hassabis — Google DeepMind: The Podcast 2024-08-16T00:00:04.712Z
GPT-4o System Card 2024-08-08T20:30:52.633Z
AI labs can boost external safety research 2024-07-31T19:30:16.207Z
Safety consultations for AI lab employees 2024-07-27T15:00:27.276Z
New page: Integrity 2024-07-10T15:00:41.050Z
Claude 3.5 Sonnet 2024-06-20T18:00:35.443Z
Anthropic's Certificate of Incorporation 2024-06-12T13:00:30.806Z
Companies' safety plans neglect risks from scheming AI 2024-06-03T15:00:20.236Z
AI companies' commitments 2024-05-29T11:00:31.339Z
Maybe Anthropic's Long-Term Benefit Trust is powerless 2024-05-27T13:00:47.991Z
AI companies aren't really using external evaluators 2024-05-24T16:01:21.184Z
New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z
Frontier AI Regulation 2023-07-10T14:30:06.366Z
AI labs' statements on governance 2023-07-04T16:30:01.624Z
DeepMind: Model evaluation for extreme risks 2023-05-25T03:00:00.915Z
GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion 2023-05-15T01:42:41.012Z
Stopping dangerous AI: Ideal US behavior 2023-05-09T21:00:55.187Z
Stopping dangerous AI: Ideal lab behavior 2023-05-09T21:00:19.505Z
Slowing AI: Crunch time 2023-05-03T15:00:12.495Z
Ideas for AI labs: Reading list 2023-04-24T19:00:00.832Z
Slowing AI: Interventions 2023-04-18T14:30:35.746Z
AI policy ideas: Reading list 2023-04-17T19:00:00.604Z
Slowing AI: Foundations 2023-04-17T14:30:09.427Z
Slowing AI: Reading list 2023-04-17T14:30:02.467Z
FLI report: Policymaking in the Pause 2023-04-15T17:01:06.727Z
FLI open letter: Pause giant AI experiments 2023-03-29T04:04:23.333Z

Comments

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-09T01:00:13.295Z · LW · GW

I think this post was underrated; I look back at it frequently: AI labs can boost external safety research. (It got some downvotes but no comments — let me know if it's wrong/bad.)

[Edit: it was at 15 karma.]

Comment by Zach Stein-Perlman on Dan Braun's Shortform · 2024-10-05T20:09:51.107Z · LW · GW

I agree safety-by-control kinda requires good security. But safety-by-alignment kinda requires good security too.

Comment by Zach Stein-Perlman on If I have some money, whom should I donate it to in order to reduce expected P(doom) the most? · 2024-10-03T19:10:40.508Z · LW · GW

I am excited about donations to all of the following, in no particular order:

  • AI governance
    • GovAI (mostly research) [actually I haven't checked whether they're funding-constrained]
    • IAPS (mostly research)
    • Horizon (field-building)
    • CLTR (policy engagement)
  • LTFF/ARM
  • Lightcone
Comment by Zach Stein-Perlman on Base LLMs refuse too · 2024-09-29T23:16:32.251Z · LW · GW

Interesting. Thanks. (If there's a citation for this, I'd probably include it in my discussions of evals best practices.)

Hopefully evals almost never trigger "spontaneous sandbagging"? Hacking and bio capabilities and so forth are generally more like carjacking than torture.

Comment by Zach Stein-Perlman on Leon Lang's Shortform · 2024-09-29T22:02:52.842Z · LW · GW

(Just the justification, of course; fixed.)

Comment by Zach Stein-Perlman on Base LLMs refuse too · 2024-09-29T22:00:40.053Z · LW · GW

I agree noticing whether the model is refusing and, if so, bypassing refusals in some way is necessary for good evals (unless the refusal is super robust—such that you can depend on it for safety during deployment—and you're not worried about rogue deployments). But that doesn't require fine-tuning — possible alternatives include jailbreaking or few-shot prompting. Right?

(Fine-tuning is nice for eliciting stronger capabilities, but that's different.)

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-28T18:22:35.496Z · LW · GW

I think I was thinking:

  1. The war room transcripts will leak publicly
  2. Generals can secretly DM each other, while keeping up appearances in the shared channels
  3. If a general believes that all of their communication with their team will leak, we're be back to a unilateralist's curse situation: if a general thinks they should nuke, obviously they shouldn't say that to their team, so maybe they nuke unilaterally
    1. (Not obvious whether this is an infohazard)
  4. [Probably some true arguments about the payoff matrix and game theory increase P(mutual destruction). Also some false arguments about game theory — but maybe an infohazard warning makes those less likely to be posted too.]

(Also after I became a general I observed that I didn't know what my "launch code" was; I was hoping the LW team forgot to give everyone launch codes and this decreased P(nukes); saying this would would cause everyone to know their launch codes and maybe scare the other team.)

I don't think this is very relevant to real-world infohazards, because this is a game with explicit rules and because in the real world the low-hanging infohazards have been shared, but it seems relevant to mechanism design.

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-27T18:30:59.210Z · LW · GW

Update with two new responses:

I think this is 10 generals, 1 petrov, and one other person (either the other petrov or a citizen, not sure, wasn't super rigorous)

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:51:49.845Z · LW · GW

The post says generals' names will be published tomorrow.

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:50:51.202Z · LW · GW

No. I noticed ~2 more subtle infohazards and I was wishing for nobody to post them and I realized I can decrease that probability by making an infohazard warning.

I ask that you refrain from being the reason that security-by-obscurity fails, if you notice subtle infohazards.

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:29:37.419Z · LW · GW

Some true observations are infohazards, making destruction more likely. Please think carefully before posting observations. Even if you feel clever. You can post hashes here instead to later reveal how clever you were, if you need.

LOOSE LIPS SINK SHIPS

Comment by Zach Stein-Perlman on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:06:27.615Z · LW · GW

I think it’s better to be angry at the team that launched the nukes?

Comment by Zach Stein-Perlman on Mira Murati leaves OpenAI/ OpenAI to remove non-profit control · 2024-09-26T02:45:00.888Z · LW · GW

Two other leaders are also leaving

Comment by Zach Stein-Perlman on Model evals for dangerous capabilities · 2024-09-24T04:51:47.985Z · LW · GW

No. But I’m skeptical: seems hard to imagine provable safety, much less competitive with the default path to powerful AI, much less how post-hoc evals are relevant.

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-09-24T02:09:54.822Z · LW · GW

I ~always want to see the outline when I first open a post and when I'm reading/skimming through it. I wish the outline appeared when-not-hover-over-ing for me.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-24T00:30:13.849Z · LW · GW

Figuring out whether an RSP is good is hard.[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they're supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that's just for fully-fleshed-out RSPs — in reality the labs haven't operationalized their high-level thresholds and sometimes don't really have responses planned. And different labs' RSPs have different structures/ontologies.

Quantitatively comparing RSPs in a somewhat-legible manner is even harder.

I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we'll reach them before it's too late?—into a single criterion, "Credibility." Most of my concern about labs' RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn't matter how good the rest of your RSP is. (Also, minor: the "Difficulty" indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)

(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it's quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)

(And maybe it's reasonable for labs to not do so much specific-committing-in-advance.)

  1. ^

    Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)

Comment by Zach Stein-Perlman on Model evals for dangerous capabilities · 2024-09-23T17:33:53.306Z · LW · GW

Stronger scaffolding could skew evaluation results.

Stronger scaffolding makes evals better.

I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there's no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)

Comment by Zach Stein-Perlman on Model evals for dangerous capabilities · 2024-09-23T11:00:13.443Z · LW · GW

Footnote to table cells D18 and D19:

My reading of Anthropic's ARA threshold, which nobody has yet contradicted:

  1. The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3. (Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it's supposed to be an operationalization of the safety buffer, 6x below ASL-3.[1])
  2. The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says "ARA Yellow Lines are clearly defined in the RSP" but the RSP's ARA threshold was not just a yellow line.)
  3. This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].

(It's totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it's just a yellow line.)

(The forthcoming RSP update will make old thresholds moot; I'm just concerned that Anthropic ignored the RSP in the past.)

(Anthropic didn't cross the threshold and so didn't violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn't necessarily respond as required by the RSP.)

  1. ^

    Update: another part of the RSP says this threshold implements the safety buffer.

Comment by Zach Stein-Perlman on GPT-o1 · 2024-09-16T20:34:48.969Z · LW · GW

It's "o1" or "OpenAI o1," not "GPT-o1."

Comment by Zach Stein-Perlman on OpenAI o1 · 2024-09-12T23:51:33.825Z · LW · GW

They try to notice and work around spurious failures. Apparently 10 days was not long enough to resolve o1's spurious failures.

METR could not confidently upper-bound the capabilities of the models during the period they had model access, given the qualitatively strong reasoning and planning capabilities, substantial performance increases from a small amount of iteration on the agent scaffold, and the high rate of potentially fixable failures even after iteration.

(Plus maybe they try to do good elicitation in other ways that require iteration — I don't know.)

Comment by Zach Stein-Perlman on OpenAI o1 · 2024-09-12T19:03:01.745Z · LW · GW

Benchmark scores are here.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-07T23:50:15.699Z · LW · GW

This shortform is relevant to e.g. understanding what's going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.

@Neel Nanda 

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-07T16:52:20.535Z · LW · GW

There's a selection effect in what gets posted about. Maybe someone should write the "ways Anthropic is better than others" list to combat this.

Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-06T20:11:41.192Z · LW · GW
  1. I tentatively think this is a high-priority ask
  2. Capabilities research isn't a monolith and improving capabilities without increasing spooky black-box reasoning seems pretty fine
  3. If you're right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that's safe to share (rather than research that only has value if Anthropic wins the race)
Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-09-06T19:30:04.967Z · LW · GW

I was recently surprised to notice that Anthropic doesn't seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it's not publishing. E.g. my impression is that it's not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.

Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it's not, not-publishing-safety-reseach is baffling.)

Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.

(I think this is not a priority for me to investigate but I'm interested in info and takes.)

[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]

  1. ^

    I failed to find good sources saying Anthropic publishes its safety research. I did find:

    1. https://www.anthropic.com/research says "we . . . share what we learn [on safety]."
    2. President Daniela Amodei said "we publish our safety research" on a podcast once.
    3. Edit: cofounder Chris Olah said "we plan to share the work that we do on safety with the world, because we ultimately just want to help people build safe models, and don’t want to hoard safety knowledge" on a podcast once.
    4. Cofounder Nick Joseph said this on a podcast recently (seems false but it's just a podcast so that's not so bad):
      > we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”

    Edit: also cofounder Chris Olah said "we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk." But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.

Comment by Zach Stein-Perlman on nikola's Shortform · 2024-09-04T22:09:15.043Z · LW · GW

OpenAI has never (to my knowledge) made public statements about not using AI to automate AI research

I agree.

Another source:

OpenAI intends to use Strawberry to perform research. . . .

Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.

To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. . . .

OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T16:08:04.282Z · LW · GW

This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)

[Edit after Buck's reply: I think it's weaker because most Anthropic employees aren't causing the possible-deaths, just participating in a process that might cause deaths.]

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T09:22:41.243Z · LW · GW

(The LTBT got the power to appoint one board member in fall 2023, but didn't do so until May. It got power to appoint a second in July, but hasn't done so yet. It gets power to appoint a third in November. It doesn't seem to be on track to make a third appointment in November.)

(And the LTBT might make non-independent appointments, in particular keeping Daniela.)

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T03:51:44.551Z · LW · GW

Sure, good point. But it's far from obvious that the best interventions long-term-wise are the best short-term-wise, and I believe people are mostly just thinking about short-term stuff. I'd feel better if people talked about training data or whatever rather than just "protect any interests that warrant protecting" and "make interventions and concessions for model welfare."

(As far as I remember, nobody's published a list of how short-term AI welfare stuff can boost long-term AI welfare stuff that includes the training-data thing you mention. This shows that people aren't thinking about long-term stuff. Actually there hasn't been much published on short-term stuff either, so: shrug.)

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T23:00:50.738Z · LW · GW

tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)


Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:

  • Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
  • Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: neglecting inaction risk)

I'm worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.

There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):

  • If we're accidentally torturing AI systems, they're more likely to take catastrophic actions. We should try to verify that AIs are ok with their situation and take it seriously if not.[3]
  • It would improve safety if we were able to pay or trade with near-future potentially-misaligned AI, but we're not currently able to, likely including because we don't understand AI-welfare-adjacent stuff well enough.
  • [Decision theory mumble mumble.]
  • Also just "shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run," somehow.
  • [More.]

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to "protect any interests that warrant protecting" and "make interventions and concessions for model welfare" at ASL-4. I'm very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you're actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I've heard from) aren't thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)

(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn't so costly and doesn't crowd out more important AI welfare work.)

I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.


One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).


I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I'm worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)

I like and appreciate this post.

  1. ^

    Or—worse—to avoid being the ones to cause short-term AI suffering.

  2. ^

    E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.

  3. ^

    Related to this and the following bullet: Ryan Greenblatt's ideas.

  4. ^

    For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T20:30:49.848Z · LW · GW

Yes for one mechanism. It's unclear but it sounds like "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company’s stockholders who have held a sufficient percentage of the company’s equity for a sufficient period of time" describes a mysterious separate mechanism for Anthropic/stockholders to disempower the trustees.

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T20:24:30.951Z · LW · GW

(If they're sufficiently unified, stockholders have power over the LTBT. The details are unclear. See my two posts on the topic.)

Comment by Zach Stein-Perlman on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-03T20:00:32.667Z · LW · GW

Our board, with support from the controlling long-term benefit trust (LTBT) and outside partners, forms the third line in the three lines of defense model, providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.

Someone suggested that I point out that this is misleading. The board is not independent: it's two executives, one investor, and one other guy. And the board has the hard power here, modulo the LTBT's ability to elect/replace board members. And the LTBT does not currently have AI safety expertise. And Dario at least is definitely "involved in the development or execution of our plans."

(I'm writing this comment because I believe it, but with this disclaimer because it's not the highest-priority comment from my perspective.)

(Edit: I like and appreciate this post.)

Comment by Zach Stein-Perlman on Leon Lang's Shortform · 2024-08-29T07:08:45.616Z · LW · GW

I listened to it. I don't recommend it. Anca seems good and reasonable but the conversation didn't get into details on misalignment, scalable oversight, or DeepMind's Frontier Safety Framework.

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-08-27T05:14:40.161Z · LW · GW

The standard refrain is that Anthropic is better than [the counterfactual, especially OpenAI but also China], I think.

Worry about China gives you as much reason to work on capabilities at OpenAI etc. as at Anthropic.

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-08-26T03:36:14.218Z · LW · GW

https://archive.is/HJgHb but Linch probably quoted all relevant bits

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-24T19:20:59.269Z · LW · GW

Sorry, reacts are ambiguous.

I agree Anthropic doesn't have a "real plan" in your sense, and narrow disagreement with Zac on that is fine.

I just think that's not a big deal and is missing some broader point (maybe that's a motte and Anthropic is doing something bad—vibes from Adam's comment—is a bailey).

[Edit: "Something must be done. Anthropic's plan is something." is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]

[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I'm ok with one more reply to this.]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-24T17:31:17.785Z · LW · GW

In the AI case, there's lots of inaction risk: if Anthropic doesn't make powerful AI, someone less safety-focused will.

It's reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn't exist, I would want Anthropic to slow down.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-22T22:32:14.465Z · LW · GW

New SB 1047 letters: OpenAI opposes; Anthropic sees pros and cons. More here.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-21T23:05:59.866Z · LW · GW

Source: Hill & Valley Forum on AI Security (May 2024):

https://www.youtube.com/live/RqxE3ub7wWA?t=13338s:

very powerful systems [] may have national security uses or misuses. And for that I think we need to come up with tests that make sure that we don’t put technologies into the market which could—unwittingly to us—advantage someone or allow some nonstate actor to commit something harmful. Beyond that I think we can mostly rely on existing regulations and law and existing testing procedures . . . and we don’t need to create some entirely new infrastructure.

https://www.youtube.com/live/RqxE3ub7wWA?t=13551

At Anthropic we discover that the more ways we find to use this technology the more ways we find it could help us. And you also need a testing and measurement regime that closely looks at whether the technology is working—and if it’s not how you fix it from a technological level, and if it continues to not work whether you need some additional regulation—but . . . I think the greatest risk is us [viz. America] not using it [viz. AI]. Private industry is making itself faster and smarter by experimenting with this technology . . . and I think if we fail to do that at the level of the nation, some other entrepreneurial nation will succeed here.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-20T23:06:47.425Z · LW · GW

Thanks. Is this because of posttraining? Ignoring posttraining, I'd rather that evaluators get the 90% through training model version and are unrushed than the final version and are rushed — takes?

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-20T19:43:31.597Z · LW · GW

I want to avoid this being negative-comms for Anthropic. I'm generally happy to loudly criticize Anthropic, obviously, but this was supposed to be part of the 5% of my work that I do because someone at the lab is receptive to feedback, where the audience was Zac and publishing was an afterthought. (Maybe the disclaimers at the top fail to negate the negative-comms; maybe I should list some good things Anthropic does that no other labs do...)

Also, this is low-effort.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-20T19:30:14.057Z · LW · GW

Edit, 2.5 days later: I think this list is fine but sharing/publishing it was a poor use of everyone's attention. Oops.

Asks for Anthropic

Note: I think Anthropic is the best frontier AI lab on safety. I wrote up asks for Anthropic because it's most likely to listen to me. A list of asks for any other lab would include most of these things plus lots more. This list was originally supposed to be more part of my help labs improve project than my hold labs accountable crusade.

Numbering is just for ease of reference.

1. RSP: Anthropic should strengthen/clarify the ASL-3 mitigations, or define ASL-4 such that the threshold is not much above ASL-3 but the mitigations much stronger. I'm not sure where the lowest-hanging mitigation-fruit is, except that it includes control.

2. Control: Anthropic (like all labs) should use control mitigations and control evaluations to reduce risks from AIs scheming, including escape during internal deployment.

3. External model auditing for risk assessment: Anthropic (like all labs) should let auditors like METR, UK AISI, and US AISI audit its models if they want to — Anthropic should offer them good access pre-deployment and let them publish their findings or flag if they're concerned. (Anthropic shared some access with UK AISI before deploying Claude 3.5 Sonnet, but it doesn't seem to have been deep.) (Anthropic has said that sharing with external auditors is hard or costly. It's not clear why, for just sharing normal API access + helpful-only access + control over inference-time safety features, without high-touch support.)

4. Policy advocacy (this is murky, and maybe driven by disagreements-on-the-merits and thus intractable): Anthropic (like all labs) should stop advocating against good policy and ideally should advocate for good policy. Maybe it should also be more transparent about policy advocacy. [It's hard to make precise what I believe is optimal and what I believe is unreasonable, but at the least I think Dario is clearly too bullish on self-governance, and Jack Clark is clearly too anti-regulation, and all of this would be OK if it was balanced out by some public statements or policy advocacy that's more pro-real-regulation but as far as I can tell it's not. Not justified here but I predict almost all of my friends would agree if they looked into it for an hour.]

5a. Security: Anthropic (like all labs) should ideally implement RAND SL4 for model weights and code when reaching ASL-3. I think that's unrealistic, but lesser security improvements would also be good. (Anthropic said in May 2024 that 8% of staff work in security-related areas. I think this is pretty good. I think on current margins Anthropic could still turn money into better security reasonably effectively, and should do so.)

5b. Anthropic (like all labs) should be more transparent about the quality of its security. Anthropic should publish the private reports on https://trust.anthropic.com/, redacted as appropriate. It should commit to publish information on future security incidents and should publish information on all security incidents from the last year or two.

6. Anthropic (like all labs) should facilitate employees publicly flagging false statements or violated processes.

7. Anthropic takes credit for its Long-Term Benefit Trust but Anthropic hasn't published enough to show that it's effective. Anthropic should publish the Trust Agreement, clarify the ambiguities discussed in the linked posts, and make accountability-y commitments like if major changes happen to the LTBT we'll quickly tell the public.

8. Anthropic should avoid exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research. (See e.g. Stephen Casper.)

9. Maybe Anthropic (like all labs) should make safety cases for its models or deployments, especially after the simple "no dangerous capabilities" safety case doesn't work anymore, and publish them (or maybe just share with external auditors).

9.5. Anthropic should clarify a few confusing RSP things, including (a) the deal with substantially raising the ARA bar for ASL-3, and moreover deciding the old threshold is a "yellow line" and not creating a new threshold, and doing so without officially updating the RSP (and quietly); and (b) when the "every 3 months" trigger for RSP evals is active. I haven't tried hard to get to the bottom of these.

 

Minor stuff:

10. Anthropic (like all labs) should fully release everyone from nondisparagement agreements and not use nondisparagement agreements in the future.

11. Anthropic should commit to publish updates on risk assessment practices and results, including low-level details, perhaps for all major model releases and at least quarterly or so. (Anthropic says its Responsible Scaling Officer does this internally. Anthropic publishes model cards and has published one Responsible Scaling Policy Evaluations Report.)

12. Anthropic should confirm that its old policy don't meaningfully advance the frontier with a public launch has been replaced by the RSP, if that's true, and otherwise clarify Anthropic's policy.

Done! 13. Anthropic committed to establish a bug bounty program (for model issues) or similar, over a year ago. Anthropic hasn't; it is the only frontier lab without a bug bounty program (although others don't necessarily comply with the commitment, e.g. OpenAI's excludes model issues). It should do this or talk about its plans.

14. [Anthropic should clarify its security commitments; I expect it will in its forthcoming RSP update.]

15. [Maybe Anthropic (like all labs) should better boost external safety research, especially by giving more external researchers deep model access (e.g. fine-tuning or helpful-only). I hear this might be costly but I don't really understand why.]

16. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they're not speaking for Anthropic and (2) don't share secrets.]

17. [Maybe Anthropic (like all labs) should talk about its views on AI progress and risk. At the least, probably Anthropic (like all labs) should clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it.]

18. [Most of my peers say: Anthropic (like all labs) should publish info like training compute and #parameters for each model. I'm inside-view agnostic on this.]

19. [Maybe Anthropic could cheaply improve its model evals for dangerous capabilities or share more information about them. Specific asks/recommendations TBD. As Anthropic notes, its CBRN eval is kinda bad and its elicitation is kinda bad (and it doesn't share enough info for us to evaluate its elicitation from the outside).]

 

I shared this list—except 9.5 and 19, which are new—with @Zac Hatfield-Dodds two weeks ago.

 

You are encouraged to comment with other asks for Anthropic. (Or things Anthropic does very well, if you feel so moved.)

Comment by Zach Stein-Perlman on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-20T19:00:55.576Z · LW · GW

Yay DeepMind safety humans for doing lots of (seemingly-)good safety work. I'm particularly happy with DeepMind's approach to creating and sharing dangerous capability evals.

Yay DeepMind for growing the safety teams substantially:

We’ve also been growing since our last post: by 39% last year, and by 37% so far this year.

What's the size of the AGI Alignment and Frontier Safety teams now?

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-08-17T02:23:22.982Z · LW · GW

I can't jump to the comments on my phone.

Comment by Zach Stein-Perlman on Demis Hassabis — Google DeepMind: The Podcast · 2024-08-16T17:53:51.106Z · LW · GW

Note this was my paraphrase/summary.

Comment by Zach Stein-Perlman on GPT-4o System Card · 2024-08-10T19:40:29.143Z · LW · GW

Why is fine-tuning especially important for evaluating scheming capability?

[Edit: I think: fine-tuning better elicits capabilities + reduces possible sandbagging]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-09T16:32:13.273Z · LW · GW

Dan was a cofounder.

Comment by Zach Stein-Perlman on GPT-4o System Card · 2024-08-08T21:53:57.219Z · LW · GW

You are remembering something real, although Anthropic has said this neither publicly/officially nor recently. See https://www.lesswrong.com/posts/JbE7KynwshwkXPJAJ/anthropic-release-claude-3-claims-greater-than-gpt-4#JhBNSujBa3c78Epdn. Please do not continue that discourse on this post.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-08-08T19:15:34.929Z · LW · GW

Zico Kolter Joins OpenAI’s Board of Directors. OpenAI says "Zico's work predominantly focuses on AI safety, alignment, and the robustness of machine learning classifiers."

Misc facts: