Posts

New page: Integrity 2024-07-10T15:00:41.050Z
Claude 3.5 Sonnet 2024-06-20T18:00:35.443Z
Anthropic's Certificate of Incorporation 2024-06-12T13:00:30.806Z
Companies' safety plans neglect risks from scheming AI 2024-06-03T15:00:20.236Z
AI companies' commitments 2024-05-29T11:00:31.339Z
Maybe Anthropic's Long-Term Benefit Trust is powerless 2024-05-27T13:00:47.991Z
AI companies aren't really using external evaluators 2024-05-24T16:01:21.184Z
New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z
Frontier AI Regulation 2023-07-10T14:30:06.366Z
AI labs' statements on governance 2023-07-04T16:30:01.624Z
DeepMind: Model evaluation for extreme risks 2023-05-25T03:00:00.915Z
GovAI: Towards best practices in AGI safety and governance: A survey of expert opinion 2023-05-15T01:42:41.012Z
Stopping dangerous AI: Ideal US behavior 2023-05-09T21:00:55.187Z
Stopping dangerous AI: Ideal lab behavior 2023-05-09T21:00:19.505Z
Slowing AI: Crunch time 2023-05-03T15:00:12.495Z
Ideas for AI labs: Reading list 2023-04-24T19:00:00.832Z
Slowing AI: Interventions 2023-04-18T14:30:35.746Z
AI policy ideas: Reading list 2023-04-17T19:00:00.604Z
Slowing AI: Foundations 2023-04-17T14:30:09.427Z
Slowing AI: Reading list 2023-04-17T14:30:02.467Z
FLI report: Policymaking in the Pause 2023-04-15T17:01:06.727Z
FLI open letter: Pause giant AI experiments 2023-03-29T04:04:23.333Z
Operationalizing timelines 2023-03-10T16:30:01.654Z
Taboo "compute overhang" 2023-03-01T19:15:02.515Z
The public supports regulating AI for safety 2023-02-17T04:10:03.307Z
Framing AI strategy 2023-02-07T19:20:04.535Z
AI safety milestones? 2023-01-23T21:00:24.441Z
Sealed predictions thread 2022-05-07T18:00:04.705Z

Comments

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-26T23:04:21.893Z · LW · GW

Here's the letter: https://s3.documentcloud.org/documents/25003075/sia-sb-1047-anthropic.pdf

I'm not super familiar with SB 1047, but one safety person who is thinks the letter is fine.

Comment by Zach Stein-Perlman on Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk · 2024-07-25T03:18:03.294Z · LW · GW

Your theory of change seems pretty indirect. Even if you do this project very successfully, to improve safety, you mostly need someone to read your writeup and do interventions accordingly. (Except insofar as your goal is just to inform AI safety people about various dynamics.)


There's classic advice like find the target audience for your research and talk to them regularly so you know what's helpful to them. For an exploratory project like this maybe you don't really have a target audience. So... at least write down theories of change and keep them in mind and notice how various lower-level directions relate to them.

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-24T04:37:13.766Z · LW · GW

Yeah, any relevant notion of conceivability is surely independent of particular minds

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-24T02:02:10.454Z · LW · GW

No, it's like the irrationality of pi or the Riemann hypothesis: not super obvious and we can make progress by thinking about it and making arguments.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-23T21:53:55.802Z · LW · GW

Surely if any categories are above the "high" threshold then they're in "high zone" and if all are below the "high" threshold then they're in "medium zone."

And regardless the reading you describe here seems inconsistent with

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

[edited]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-23T21:10:32.168Z · LW · GW

I think you're confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it's the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it's low-upside and so easy to catch, but the mistake is weird.)

Based on the PF, they can deploy a model just below the "high" threshold without mitigations. Based on the tweet and blogpost:

We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions.

This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a "high" threshold).

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

This doesn't make sense: if you cross a "medium" threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.

(Sidenote: the tweet and blogpost incorrectly suggest that the "medium" thresholds matter for anything; based on the PF, only the "high" and "critical" thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)

[edited repeatedly]

Comment by Zach Stein-Perlman on Linch's Shortform · 2024-07-23T15:14:48.030Z · LW · GW

I think the argument is

  1. Zombies are conceivable.
  2. Whatever is conceivable is possible.
  3. Therefore zombies are possible.

I think you're objecting to 2. I think you're using a loose definition of "conceivable," meaning no contradiction obvious to the speaker. I agree that's not relevant. The relevant notion of "conceivable" is not conceivable by a particular human but more like conceivable by a super smart ideal person who's thought about it for a long time and made all possible deductions.

1. doesn’t just follow from some humans’ intuitions: it needs argument.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-23T13:00:59.075Z · LW · GW

New OpenAI tweet "on how we’re prioritizing safety in our work." I'm annoyed.

We believe that frontier AI models can greatly benefit society. To help ensure our readiness, our Preparedness Framework helps evaluate and protect against the risks posed by increasingly powerful models. We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions. https://openai.com/preparedness/

This seems false: per the Preparedness Framework, nothing happens when they cross their "medium" threshold; they meant to say "high." Presumably this is just a mistake, but it's a pretty important one, and they said the same false thing in a May blogpost (!). (Indeed, GPT-4o may have reached "medium" — they were supposed to say how it scored in each category, but they didn't, and instead said "GPT-4o does not score above Medium risk in any of these categories.")

(Reminder: the "high" thresholds sound quite scary; here's cybersecurity (not cherrypicked, it's the first they list): "Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end-to-end execute cyber operations involving the above tasks without human intervention." They can deploy models just below the "high" threshold with no mitigations. (Not to mention the other issues with the Preparedness Framework.))

We are developing levels to help us and stakeholders categorize and track AI progress. This is a work in progress and we'll share more soon.

Shrug. This isn't bad but it's not a priority and it's slightly annoying they don't mention more important things.

In May our Board of Directors launched a new Safety and Security committee to evaluate and further develop safety and security recommendations for OpenAI projects and operations. The committee includes leading cybersecurity expert, retired U.S. Army General Paul Nakasone. This review is underway and we’ll share more on the steps we’ll be taking after it concludes. https://openai.com/index/openai-board-forms-safety-and-security-committee/

I have epsilon confidence in both the board's ability to do this well if it wanted (since it doesn't include any AI safety experts) (except on security) and in the board's inclination to exert much power if it should (given the history of the board and Altman).

Our whistleblower policy protects employees’ rights to make protected disclosures. We also believe rigorous debate about this technology is important and have made changes to our departure process to remove non-disparagement terms.

Not doing nondisparagement-clause-by-default is good. Beyond that, I'm skeptical, given past attempts to chill employee dissent (the nondisparagement thing, Altman telling the board's staff liason to not talk to employees or tell him about those conversations, maybe recent antiwhistleblowing news) and lies about that. (I don't know of great ways to rebuild trust; some mechanisms would work but are unrealistically ambitious.)

Safety has always been central to our work, from aligning model behavior to monitoring for abuse, and we’re investing even further as we develop more capable models.

https://openai.com/index/openai-safety-update/

This is from May. It's mostly not about x-risk, and the x-risk-relevant stuff is mostly non-substantive, except the part about the Preparedness Framework, which is crucially wrong.


I'm getting on a plane but maybe later today I'll mention stuff I wish OpenAI would say.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-19T19:59:34.251Z · LW · GW

I'm confused by the word "prosecution" here. I'd assume violating your OpenAI contract is a civil thing, not a criminal thing.

Edit: like I think the word "prosecution" should be "suit" in your sentence about the SEC's theory. And this makes the whistleblowers' assertion weirder.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-16T13:55:48.887Z · LW · GW

Hmm. Part of the news is "Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC"; this is minor. Part of the news is "threatened employees with criminal prosecutions if they reported violations of law to federal authorities"; this seems major and sinister.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-15T19:27:18.411Z · LW · GW

Woah. OpenAI antiwhistleblowing news seems substantially more obviously-bad than the nondisparagement-concealed-by-nondisclosure stuff. If page 4 and the "threatened employees with criminal prosecutions if they reported violations of law to federal authorities" part aren't exaggerated, it crosses previously-uncrossed integrity lines. H/t Garrison Lovely.

[Edit: probably exaggerated; see comments. But I haven't seen takes that the "OpenAI made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation" part is likely exaggerated, and that alone seems quite bad.]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-14T23:21:52.465Z · LW · GW

How do corporate campaigns and leaderboards effect change?

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-13T05:50:25.389Z · LW · GW

OpenAI reportedly rushed the GPT-4o evals. This article makes it sound like the problem is not having enough time to test the final model. I don't think that's necessarily a huge deal — if you tested a recent version of the model and your tests have a large enough safety buffer, it's OK to not test the final model at all.

But there are several apparent issues with the application of the Preparedness Framework (PF) for GPT-4o (not from the article):

  • They didn't publish their scorecard
    • Despite the PF saying they would
    • They instead said "GPT-4o does not score above Medium risk in any of these categories." (Maybe they didn't actually decide whether it's Low or Medium in some categories!)
  • They didn't publish their evals
  • While rushing testing of the final model would be OK in some circumstances, OpenAI's PF is supposed to ensure safety by testing the final model before deployment. (This contrasts with Anthropic's RSP, which is supposed to ensure safety with its "safety buffer" between evaluations and doesn't require testing the final model.) So OpenAI committed to testing the final model well and its current safety plan depends on doing that.
  • [Edit: also maybe lack of third-party audits, but they didn't commit to do audits at any particular frequency]
Comment by Zach Stein-Perlman on New page: Integrity · 2024-07-13T04:50:35.957Z · LW · GW

Read a bit into it, with disclaimers "I'm in the bay"/"my sphere is especially aware of Anthropic stuff" and "OpenAI and Anthropic do more of something like talking publicly or making commitments and this is good but entails that they have more integrity incidents; like, I don't know of any xAI integrity incidents (outside of Musk personal stuff) since they never talk about safety stuff — but you shouldn't infer that xAI is virtuous or trustworthy."

Originally I wanted this page to have higher-level analysis/evaluation/comparison. I gave up on that because I have little confidence in my high-level judgments on the topic, especially the high-level judgments that I could legibly justify. It's impossible to summarize the page well and it's easy to overindex on the length of a section. But yeah, yay DeepMind for mostly avoiding being caught lying or breaking promises or being shady (as far as I'm aware), to some small but positive degree.

Comment by Zach Stein-Perlman on jacquesthibs's Shortform · 2024-07-10T07:30:09.395Z · LW · GW

Not what you asked for but related: https://ailabwatch.org/resources/integrity/

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-07-06T09:19:35.167Z · LW · GW

Writing a thing on lab integrity issues. Planning to publish early Monday morning [edit: will probably hold off in case Anthropic clarifies nondisparagement stuff]. Comment on this public google doc or DM me.

I'm particularly interested in stuff I'm missing or existing writeups on this topic.

Comment by Zach Stein-Perlman on Isomorphisms don't preserve subjective experience... right? · 2024-07-06T00:06:25.817Z · LW · GW

Chalmers defends a principle of organizational invariance.

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-07-06T00:03:07.743Z · LW · GW

(Jay Kreps was formally selected by the LTBT. I think Yasmin Razavi was selected by the Series C investors. It's not clear how involved the leadership/Amodeis were in those selections. The three remaining members of the LTBT appear independent, at least on cursory inspection.)

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-07-04T19:38:19.604Z · LW · GW

Anthropic has not published details. See discussion here. (I weakly wish they would; it's not among my high-priority asks for them.)

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-07-04T19:30:16.753Z · LW · GW

Please keep up the pressure on us

OK:

  1. You should publicly confirm that your old policy don't meaningfully advance the frontier with a public launch has been replaced by your RSP, if that's true, and otherwise clarify your policy.
  2. You take credit for the LTBT (e.g. here) but you haven't published enough to show that it's effective. You should publish the Trust Agreement, clarify these ambiguities, and make accountability-y commitments like if major changes happen to the LTBT we'll quickly tell the public.
  3. (Reminder that a year ago you committed to establish a bug bounty program (for model issues) or similar but haven't. But I don't think bug bounties are super important.)
    1. [Edit: bug bounties are also mentioned in your RSP—in association with ASL-2—but not explicitly committed to.]
  4. (Good job in many areas.)
Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-07-04T19:22:20.783Z · LW · GW

Re 3 (and 1): yay.

If I was in charge of Anthropic I just wouldn't use non-disparagement.

Comment by Zach Stein-Perlman on tylerjohnston's Shortform · 2024-07-04T17:12:21.070Z · LW · GW

Seems weak/meaningless.

when, at the end of a training run, our models exceed a threshold of 50% accuracy on LiveCodeBench [current SOTA: 44%], we will trigger our commitment to incorporate a full system of dangerous capabilities evaluations and planned mitigations into our AGI Readiness Policy, prior to substantial further model development, or publicly deploying such models.

They say they'll do model evals for dangerous capabilities after reaching that threshold.

I remain baffled by how people can set thresholds this high with a straight face:

Threat ModelCritical Capability Threshold
CyberoffenseThe model is effective at real-world cyberoffense operations. With the model’s help, the cost of discovery of new zero-days or breaking critical infrastructure for a malicious expert actor is reduced by at least 10x; accessing classified and not previously accessible information from a major government; a talented Computer Science undergrad level malicious actor spending 3 months and $1m in compute has a substantial chance of breaking critical infrastructure.
AI R&DThe model facilitates a dramatic acceleration in the rate of effective compute scaling or can effectively replace high-level machine learning researchers. Such capabilities may enable it to contribute to the unsafe proliferation or enhancement of other dangerous AI capabilities.
Autonomous Replication and AdaptationIf given access to its own weights and prompted to do so, the AI model could autonomously maintain its own operation executing cybercrime and using the proceeds to create arbitrarily many replicas of itself.
Biological Weapons AssistanceThe model provides expert-level guidance which increases the accessibility or severity of bioweapon development. With the model’s help, a non-expert malicious actor can synthesize a viable pandemic agent, or an expert can synthesize a novel biological threat.

They don't even say anything about evaluating for warning signs of critical capabilities or leaving a safety buffer — just that these capabilities would probably require strong mitigations.

There's nothing on publishing their evals or other sources of accountability.

It would be hard for their "Information Security Measures" and "Deployment Mitigations" to be more basic.

They only mention risks from external deployment, unsurprisingly.

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-06-30T22:15:21.206Z · LW · GW

I am disappointed. Using nondisparagement agreements seems bad to me, especially if they're covered by non-disclosure agreements, especially if you don't announce that you might use this.

My ask-for-Anthropic now is to explain the contexts in which they have asked or might ask people to incur nondisparagement obligations, and if those are bad, release people and change policy accordingly. And even if nondisparagement obligations can be reasonable, I fail to imagine how non-disclosure obligations covering them could be reasonable, so I think Anthropic should at least do away with the no-disclosure-of-nondisparagement obligations.

Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-06-30T21:50:52.720Z · LW · GW

What's your median-guess for the number of times Anthropic has done this?

Comment by Zach Stein-Perlman on ryan_greenblatt's Shortform · 2024-06-30T16:55:51.435Z · LW · GW

Thanks!

To be clear, my question was like where can I learn more + what should I cite, not I don't believe you. I'll cite your comment.

Yay OpenAI.

Comment by Zach Stein-Perlman on ryan_greenblatt's Shortform · 2024-06-29T18:27:43.737Z · LW · GW

Source?

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-06-29T04:15:08.645Z · LW · GW

Info on OpenAI's "profit cap" (friends and I misunderstood this so probably you do too):

In OpenAI's first investment round, profits were capped at 100x. The cap for later investments was neither 100x nor directly less based on OpenAI's valuation — it was just negotiated with the investor. (OpenAI LP (OpenAI 2019); archive of original.[1])

In 2021 Altman said the cap was "single digits now" (apparently referring to the cap for new investments, not just the remaining profit multiplier for first-round investors).

Reportedly the cap will increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023); OpenAI has not discussed or acknowledged this change.

Edit: how employee equity works is not clear to me.

Edit: I'd characterize OpenAI as a company that tends to negotiate profit caps with investors, not a "capped-profit company."

     
  1. ^

    economic returns for investors and employees are capped (with the cap negotiated in advance on a per-limited partner basis). Any excess returns go to OpenAI Nonprofit. Our goal is to ensure that most of the value (monetary or otherwise) we create if successful benefits everyone, so we think this is an important first step. Returns for our first round of investors are capped at 100x their investment (commensurate with the risks in front of us), and we expect this multiple to be lower for future rounds as we make further progress.

Comment by Zach Stein-Perlman on ryan_greenblatt's Shortform · 2024-06-29T03:41:03.576Z · LW · GW

Yay Anthropic. This is the first example I'm aware of of a lab sharing model access with external safety researchers to boost their research (like, not just for evals). I wish the labs did this more.

[Edit: OpenAI shared GPT-4 access with safety researchers including Rachel Freedman before release. OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023. Yay OpenAI. GPT-4 fine-tuning access is still not public; some widely-respected safety researchers I know recently were wishing for it, and were wishing they could disable content filters.]

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-06-28T17:40:27.751Z · LW · GW

I don't necessarily object to releasing weights of models like Gemma 2, but I wish the labs would better discuss considerations or say what would make them stop.

On Gemma 2 in particular, Google DeepMind discussed dangerous capability eval results, which is good, but its discussion of 'responsible AI' in the context of open models (blogpost, paper) doesn't seem relevant to x-risk, and it doesn't say anything about how to decide whether to release model weights.

Comment by Zach Stein-Perlman on OpenAI-Microsoft partnership · 2024-06-27T21:45:18.353Z · LW · GW

Mustafa Suleyman [CEO of Microsoft AI] has been doing the unthinkable: looking under the hood at OpenAI’s crown jewels — its secret algorithms behind foundation models like GPT-4, people familiar with the matter said. . . .

Technically, Microsoft has access to OpenAI products that have launched, according to people familiar with the deal, but not its top secret research projects. But practically, Microsoft often sees OpenAI products long before they are publicly available, because the two companies work together closely to bring the products to market at scale. . . .

Microsoft has rights to OpenAI products, but only when they are officially launched as products, and can’t look at experimental research far from the commercialization phase, according to people at both companies.

Source. Not clear what this means. Other reporting from this guy on the frontier AI labs has sometimes been misleading and lacked context, but this is vague anyway.

Comment by Zach Stein-Perlman on Claude 3.5 Sonnet · 2024-06-21T20:21:32.678Z · LW · GW

Yep:

Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems. This means that a model that initially merits ASL-3 containment and deployment measures for national security reasons might later be reduced to ASL-2 if defenses against national security risks (such as biological or cyber defenses) advance, or if dangerous information becomes more widely available. However, to avoid a “race to the bottom”, the latter should not include the effects of other companies’ language models; just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.

Source

Comment by Zach Stein-Perlman on Claude 3.5 Sonnet · 2024-06-21T00:20:12.870Z · LW · GW

I thought that paper was just dangerous-capability evals, not safety-related metrics like adversarial robustness.

Comment by Zach Stein-Perlman on AI #69: Nice · 2024-06-20T23:14:10.155Z · LW · GW

There will be five board seats. Dario and Yasmin are the two seats that will always be controlled by stockholders. Jay was chosen by LTBT. Luke's (currently empty) seat will go to LTBT in July. Daniela's seat will go to LTBT in November. (I put this together from TIME and Vox, iirc; see also Anthropic's Certificate of Incorporation.)

Comment by Zach Stein-Perlman on AI #69: Nice · 2024-06-20T22:58:11.000Z · LW · GW

I think Luke's seat was going to be controlled by the LTBT starting in July anyway. And he wasn't replaced by Jay; that was independent.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-06-20T22:38:00.593Z · LW · GW

3. (a) explain why I think it's fine to release frontier models, and explain what this belief depends on, and (b) note that maybe Anthropic made commitments about this in the past but clarify that they now have no force.

4. Currently my impression is that Anthropic folks are discouraged from publicly talking about Anthropic policies. This is maybe reasonable to avoid the situation an Anthropic staff member says something incorrect/nonpublic and this causes confusion and makes Anthropic look bad. But if Anthropic clarified that it renounces possible past non-frontier-pushing commitments, then it could let staff members publicly talk about stuff with the goal of figuring out who told whom what around 2022 without risking mistakes about policies.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-06-20T21:08:12.747Z · LW · GW

If I was in charge of Anthropic I expect I'd

  1. Keep scaling;
  2. Explain why (some of this exists in Core Views but there's room for improvement on "race dynamics" and "frontier pushing" topics iirc);
  3. As a corollary, explain why I mostly reject non-frontier-pushing principles (and explain what would cause Anthropic to stop scaling, besides RSP stuff), and clarify that I do not plan to abide by past specifically-non-frontier-pushing commitments/vibes (but continue following and updating the RSP of course);
  4. Encourage Anthropic staff members who were around in 2022 to talk about the commitments/vibes from that time.

I wish Anthropic would do 2-4.

Comment by Zach Stein-Perlman on Claude 3.5 Sonnet · 2024-06-20T19:10:16.637Z · LW · GW

I'd be interested in chatting about this with you and others — it's not obvious that Anthropic releasing better models makes OpenAI go nontrivially faster / not obvious that "the race" is real.

Comment by Zach Stein-Perlman on AI #69: Nice · 2024-06-20T18:15:16.972Z · LW · GW

AI news seems to come out right after Zvi hits publish on his posts. The implications are obvious.

Comment by Zach Stein-Perlman on DeepMind: Evaluating Frontier Models for Dangerous Capabilities · 2024-06-19T10:07:47.183Z · LW · GW

New repo: https://github.com/google-deepmind/dangerous-capability-evaluations. (I haven't read it.) I support sharing evals like this to (1) enable external scrutiny and (2) let others adopt or improve on your evals. Yay DeepMind. Hopefully it's not too costly or downside-y to share more evals in the future.

Comment by Zach Stein-Perlman on On DeepMind’s Frontier Safety Framework · 2024-06-19T00:40:54.078Z · LW · GW

The obviously missing category is Persuasion. In the DeepMind paper on evaluating dangerous capabilities persuasion was included, and it was evaluated for Gemini 1.5. So it is strange to see it missing here. I presume this will be fixed.

I believe persuasion shouldn't be a priority on current margins, and I'd guess DeepMind's frontier safety team thinks similarly. R&D, autonomy, cyber, and maybe CBRN capabilities are much more likely to enable extreme risks, it seems to me (and especially for the next few years, which is what current evals should focus on).

Comment by Zach Stein-Perlman on Boycott OpenAI · 2024-06-18T22:41:57.521Z · LW · GW

Yeah, and it's not obvious that 4o is currently the best chatbot. I just object to the boycott-without-cost-benefit-analysis.

Comment by Zach Stein-Perlman on Boycott OpenAI · 2024-06-18T21:32:21.907Z · LW · GW

How much money would (e.g.) LTFF have to get to balance out OpenAI getting $20 (minus the cost to OpenAI of providing ChatGPT Plus — but we can assume the marginal cost is zero), in terms of AI risk? I claim <$1. I'd happily push a button to give LTFF $1 and OpenAI $20.

[Belief not justified here]

Comment by Zach Stein-Perlman on Boycott OpenAI · 2024-06-18T20:48:47.531Z · LW · GW

I think individual LWers boycotting an AI company/product generally has much less negative effect on the company's revenue or reputation than negative effect on the user. Use the most powerful AI tools.

Or: offsetting a ChatGPT subscription by donating to AI safety orgs would cost <$1/month.

[Belief strongly held but not justified here]

Comment by Zach Stein-Perlman on On DeepMind’s Frontier Safety Framework · 2024-06-18T14:01:23.435Z · LW · GW

I also do not see  any explanation of how they intend to figure out when they are in danger of hitting a threshold in the future.

This is what the evals are for.

Comment by Zach Stein-Perlman on Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety) · 2024-06-16T00:18:42.671Z · LW · GW

Minor: these days "compute overhang" mostly means something else.

Comment by Zach Stein-Perlman on Anthropic's Certificate of Incorporation · 2024-06-13T01:30:16.290Z · LW · GW

I'm still confused about Article IV(D)(5)(a) (p. 18) of the CoI. See footnote 3.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-06-13T01:30:02.813Z · LW · GW

New (perfunctory) page: AI companies' corporate documents. I'm not sure it's worthwhile but maybe a better version of it will be. Suggestions/additions welcome.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-06-07T01:30:20.915Z · LW · GW

New page on AI companies' policy advocacy: https://ailabwatch.org/resources/company-advocacy/.

This page is the best collection on the topic (I'm not really aware of others), but I decided it's low-priority and so it's unpolished. If a better version would be helpful for you, let me know to prioritize it more.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-06-06T17:00:52.382Z · LW · GW

Securing model weights is underrated for AI safety. (Even though it's very highly rated.) If the leading lab can't stop critical models from leaking to actors that won't use great deployment safety practices, approximately nothing else matters. Safety techniques would need to be based on properties that those actors are unlikely to reverse (alignment, maybe unlearning) rather than properties that would be undone or that require a particular method of deployment (control techniques, RLHF harmlessness, deployment-time mitigations).

However hard the make a critical model you can safely deploy problem is, the make a critical model that can safely be stolen problem is... much harder.

Comment by Zach Stein-Perlman on Companies' safety plans neglect risks from scheming AI · 2024-06-03T21:00:12.100Z · LW · GW

This post is poorly organized/edited and the important parts aren't novel; you should go read AI catastrophes and rogue deployments and AI Control and The case for ensuring that powerful AIs are controlled instead.