Posts

Anthropic leadership conversation 2024-12-20T22:00:45.229Z
o3 2024-12-20T18:30:29.448Z
DeepSeek beats o1-preview on math, ties on coding; will release weights 2024-11-20T23:50:26.597Z
Anthropic: Three Sketches of ASL-4 Safety Case Components 2024-11-06T16:00:06.940Z
The current state of RSPs 2024-11-04T16:00:42.630Z
Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority 2024-10-28T17:00:18.660Z
UK AISI: Early lessons from evaluating frontier AI systems 2024-10-25T19:00:21.689Z
Lab governance reading list 2024-10-25T18:00:28.346Z
IAPS: Mapping Technical Safety Research at AI Companies 2024-10-24T20:30:41.159Z
What AI companies should do: Some rough ideas 2024-10-21T14:00:10.412Z
Anthropic rewrote its RSP 2024-10-15T14:25:12.518Z
Model evals for dangerous capabilities 2024-09-23T11:00:00.866Z
OpenAI o1 2024-09-12T17:30:31.958Z
Demis Hassabis — Google DeepMind: The Podcast 2024-08-16T00:00:04.712Z
GPT-4o System Card 2024-08-08T20:30:52.633Z
AI labs can boost external safety research 2024-07-31T19:30:16.207Z
Safety consultations for AI lab employees 2024-07-27T15:00:27.276Z
New page: Integrity 2024-07-10T15:00:41.050Z
Claude 3.5 Sonnet 2024-06-20T18:00:35.443Z
Anthropic's Certificate of Incorporation 2024-06-12T13:00:30.806Z
Companies' safety plans neglect risks from scheming AI 2024-06-03T15:00:20.236Z
AI companies' commitments 2024-05-29T11:00:31.339Z
Maybe Anthropic's Long-Term Benefit Trust is powerless 2024-05-27T13:00:47.991Z
AI companies aren't really using external evaluators 2024-05-24T16:01:21.184Z
New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z
Frontier AI Regulation 2023-07-10T14:30:06.366Z
AI labs' statements on governance 2023-07-04T16:30:01.624Z
DeepMind: Model evaluation for extreme risks 2023-05-25T03:00:00.915Z

Comments

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-22T15:30:15.826Z · LW · GW

All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: Anthropic friends.)

This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.

Comment by Zach Stein-Perlman on Mark Xu's Shortform · 2024-12-22T06:08:22.796Z · LW · GW

I disagree with Ben. I think the usage that Mark is referring to is a reference to Death with Dignity. A central example of my usage is

it would be undignified if AI takes over because we didn't really try off-policy probes; maybe they just work; someone should figure that out

It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).

Comment by Zach Stein-Perlman on o3 · 2024-12-22T03:16:51.767Z · LW · GW

edit: wait likely it's RL; I'm confused

OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.

Sources:

Altman said

we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.

François Chollet (in the blogpost with the graph) said

Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

and

The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn't generate synthetic ARC data to improve their score.

An OpenAI staff member replied

Correct, can confirm "targeting" exclusively means including a (subset of) the public training set.

and further confirmed that "tuned" in the graph is

a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.

Another OpenAI staff member said

also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.

[heavily edited after first posting]

Comment by Zach Stein-Perlman on o3 · 2024-12-21T21:11:41.392Z · LW · GW

Welcome!

To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:59:44.055Z · LW · GW

I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).

Edit: actually I don't know what he's referring to.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:58:01.739Z · LW · GW

I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)

Comment by Zach Stein-Perlman on o3 · 2024-12-21T01:03:52.642Z · LW · GW

I think they only have formal high and low versions for o3-mini

Edit: nevermind idk

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:02:52.640Z · LW · GW

Done, thanks.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-20T22:36:05.299Z · LW · GW

I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.

Edit: actually I did another pass and edited out several more; thanks for the nudge.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T21:51:56.247Z · LW · GW

pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:52:08.414Z · LW · GW

It was one submission, apparently.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:51:25.796Z · LW · GW

and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning

The FrontierMath answers are numerical-ish ("problems have large numerical answers or complex mathematical objects as solutions"), so you can just check which answer the model wrote most frequently.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:37:30.495Z · LW · GW

The obvious boring guess is best of n. Maybe you're asserting that using $4,000 implies that they're doing more than that.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-16T22:29:14.402Z · LW · GW

My guess is they do kinda choose: in training, it's less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.

Edit: maybe this is different in procedures different from the one Rohin outlined.

Comment by Zach Stein-Perlman on [deleted post] 2024-12-16T21:32:31.794Z

The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.

Edit: debate is a technique for iterated amplification -- but that tag is terrible too, oh no

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-16T20:57:36.674Z · LW · GW

I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):

Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)

Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, -1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).

The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)

More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)

The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)

You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)

Comment by Zach Stein-Perlman on ARC Evals: Responsible Scaling Policies · 2024-12-14T23:45:29.382Z · LW · GW

This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic's RSP was published. It's since become clear that nobody knows how to write an RSP that's predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.

Comment by Zach Stein-Perlman on DeepMind: Model evaluation for extreme risks · 2024-12-14T23:45:11.627Z · LW · GW

This is the classic paper on model evals for dangerous capabilities.

On a skim, it's aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends "alignment evaluations" to determine models' propensity for misalignment, but such evals can't really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.

Comment by Zach Stein-Perlman on How much do you believe your results? · 2024-12-11T06:05:04.452Z · LW · GW

Interesting point, written up really really well. I don't think this post was practically useful for me but it's a good post regardless.

Comment by Zach Stein-Perlman on Thoughts on sharing information about language model capabilities · 2024-12-11T05:45:47.954Z · LW · GW

This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-11T05:25:35.362Z · LW · GW

To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it's very unlikely that the new model is dangerous.

DeepMind says it uses the safety-buffer plan (but it hasn't yet said it has operationalized thresholds/buffers).

Anthropic's original RSP used the safety-buffer plan; its new RSP doesn't really use either plan (kinda safety-buffer but it's very weak). (This is unfortunate.)

OpenAI seemed to use the test-the-actual-model plan.[1] This isn't going well. The 4o evals were rushed because OpenAI (reasonably) didn't want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn't be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn't seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn't notice before deployment....

(Yay OpenAI for honestly publishing eval results that don't look good.)

  1. ^

    It's not explicit. The PF says e.g. 'Only models with a post-mitigation score of "medium" or below can be deployed.' But it also mentions forecasting capabilities.

Comment by Zach Stein-Perlman on Untrusted smart models and trusted dumb models · 2024-12-10T06:30:21.505Z · LW · GW

This early control post introduced super important ideas: trusted monitoring plus the general point

if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

Comment by Zach Stein-Perlman on The current state of RSPs · 2024-12-09T00:03:41.117Z · LW · GW

Briefly:

  1. For OpenAI, I claim the cyber, CBRN, and persuasion Critical thresholds are very high (and also the cyber High threshold). I agree the autonomy Critical threshold doesn't feel so high.
  2. For Anthropic, most of the action is at ASL-4+, and they haven't even defined the ASL-4 standard yet. (So you can think of the current ASL-4 thresholds as infinitely high. I don't think "The thresholds are very high" for OpenAI was meant to imply a comparison to Anthropic; it's hard to compare since ASL-4 doesn't exist. Sorry for confusion.)
Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-12-08T03:23:43.626Z · LW · GW

Edit 2: after checking, I now believe the data strongly suggest FTX had a large negative effect on EA community metrics. (I still agree with Buck: "I don't like the fact that this essay is a mix of an insightful generic argument and a contentious specific empirical claim that I don't think you support strongly; it feels like the rhetorical strength of the former lends credence to the latter in a way that isn't very truth-tracking." And I disagree with habryka's claims that the effect of FTX is obvious.)


practically all metrics of the EA community's health and growth have sharply declined, and the extremely large and negative reputational effects have become clear.

I want more evidence on your claim that FTX had a major effect on EA reputation. Or: why do you believe it?


Edit: relevant thing habryka said that I didn't quote above:

For the EA surveys, these indicators looked very bleak: 

"Results demonstrated that FTX had decreased satisfaction by 0.5-1 points on a 10-point scale within the EA community"

"Among those aware of EA, attitudes remain positive and actually maybe increased post-FTX —though they were lower (d = -1.5, with large uncertainty) among those who were additionally aware of FTX."

"Most respondents reported continuing to trust EA organizations, though over 30% said they had substantially lost trust in EA public figures or leadership."

Comment by Zach Stein-Perlman on Model evals for dangerous capabilities · 2024-11-21T22:02:00.528Z · LW · GW

Another possible best practice for evals: use human+AI rather than [edit: just] AI alone. Many threat models involve human+AI and sometimes human+AI is substantially stronger than human alone and AI alone.

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T02:40:10.847Z · LW · GW

Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T02:40:02.374Z · LW · GW

Some context: interview translation; June FT article on DeepSeek; parent hedge fund Wikipedia page.

Comment by Zach Stein-Perlman on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:35:05.522Z · LW · GW

I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).

I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).

Comment by Zach Stein-Perlman on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:02:27.805Z · LW · GW

It would be crazy to show the untrusted model the old CoT when resampling.

Comment by Zach Stein-Perlman on [deleted post] 2024-11-18T05:02:55.099Z

Open Philanthropy's AI safety work tends toward large grants in the high hundreds of thousands or low millions, meaning individuals and organizations with lower funding needs won't be funded by them

This is directionally true but projects seeking less money should still apply to OP if relevant; a substantial minority of recent OP AI safety and GCR capacity building grants are <$100K.

Comment by Zach Stein-Perlman on Buck's Shortform · 2024-11-14T18:50:12.813Z · LW · GW

What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?

Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

Comment by Zach Stein-Perlman on Buck's Shortform · 2024-11-14T18:24:46.534Z · LW · GW

Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-11-09T04:37:19.378Z · LW · GW

We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.

Especially if the AI is even sandbagging on simply-coding when it thinks it's for safety research. And if it's not doing that, we can get some useful safety work out of it.

@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan's comments.

Comment by Zach Stein-Perlman on evhub's Shortform · 2024-11-09T04:13:46.853Z · LW · GW

I mostly agree. But I think some kinds of autonomous weapons would make loss-of-control and coups easier. But boosting US security is good so the net effect is unclear. And that's very far from the recent news (and Anthropic has a Usage Policy, with exceptions, which disallows various uses — my guess is this is too strong on weapons).

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-11-09T03:00:38.529Z · LW · GW

I just claim that if so it's easy to catch. (And then (1) catching it is nice for various reasons and (2) maybe countermeasures work.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-11-08T20:30:10.031Z · LW · GW

I brainstormed sabotage-related threats with friends:

  • Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that's high-level or has poor feedback loops
    • Plausible; maybe not so bad
  • Sandbag on ~all safety research tasks
    • Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
  • Sandbag/lie when validating that a model/deployment/whatever is safe
    • idk
  • When automating AI R&D, backdoor the model
    • (this is takeover-attempt-y, not sandbagging-y)
    • idk
  • When deployed to automate/improve security (especially for the model's own weights): both sandbagging and code-backdooring/hacking/etc.
    • idk
  • Sandbagging on AI R&D to help another lab win the race (!?)
    • Implausible
  • ??

Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?

Comment by Zach Stein-Perlman on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-07T17:10:19.422Z · LW · GW

See [V1.1]

Comment by Zach Stein-Perlman on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T16:45:07.175Z · LW · GW

When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:

  1. Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
  2. Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not

Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)

[Edited]

Comment by Zach Stein-Perlman on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T16:00:25.325Z · LW · GW

My initial reactions on a quick read:

  1. Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
  2. Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
  3. Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate
Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-31T21:00:53.706Z · LW · GW

Anthropic: The case for targeted regulation.

I like the "Urgency" section.

Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.

Edit: some strong versions of "incentivize companies to develop RSPs that are effective" would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.

Comment by Zach Stein-Perlman on AI #88: Thanks for the Memos · 2024-10-31T18:53:14.628Z · LW · GW

After Miles Brundage left to do non-profit work, OpenAI disbanded the “AGI Readiness” team he had been leading, after previously disbanding the Superalignment team and reassigning the head of the preparedness team. I do worry both about what this implies, and that Miles Brundage may have made a mistake leaving given his position.

Do we know that this is the causal story? I think OpenAI decided disempower/disband/ignore Readiness and so Miles left is likely.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-30T19:00:59.774Z · LW · GW

Some not-super-ambitious asks for labs (in progress):

  • Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
  • Have a high-level capability threshold at which securing model weights is very important
  • Do safety research at all
  • Have a safety plan at all; talk publicly about safety strategy at all
    • Have a post like Core Views on AI Safety
    • Have a post like The Checklist
    • On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
  • Whistleblower protections?
    • [Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
  • Publicly explain the processes and governance structures that determine deployment decisions
    • (And ideally make those processes and structures good)
Comment by Zach Stein-Perlman on Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority · 2024-10-28T23:10:09.066Z · LW · GW

demonstrating its benigness is not the issue, it's actually making them benign

This also stood out to me.

Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you're benign + actually be benign.

For now, model evals for dangerous capabilities (along with assurance that you're evaluating the lab's most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).

Two specific places I think are misleading:

Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.

. . .

We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.

The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you're there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.

Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it's still a good post on another problem.

Comment by Zach Stein-Perlman on IAPS: Mapping Technical Safety Research at AI Companies · 2024-10-26T21:31:32.594Z · LW · GW

@Zoe Williams 

Comment by Zach Stein-Perlman on Anthropic rewrote its RSP · 2024-10-25T18:55:16.425Z · LW · GW

Weaker:

  • AI R&D threshold: yes, the threshold is much higher
  • CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
  • ASL-3 deployment standard: yes; the changes don't feel huge but the new standard doesn't feel very meaningful
  • ASL-3 security standard: no (but both old and new are quite vague)

Vaguer: yes. But the old RSP didn't really have "precisely defined capability thresholds and mitigation measures." (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)

(New RSP, old RSP.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-10-24T22:31:17.620Z · LW · GW

Internet comments sections where the comments are frequently insightful/articulate and not too full of obvious-garbage [or at least the top comments, given the comment-sorting algorithm]:

  • LessWrong and EA Forum
  • Daily Nous
  • Stack Exchange [top comments only] (not asserting that it's reliable)
  • Some subreddits?

Where else?

Are there blogposts on adjacent stuff, like why some internet [comments sections / internet-communities-qua-places-for-truthseeking] are better than others? Besides Well-Kept Gardens Die By Pacifism.

Comment by Zach Stein-Perlman on IAPS: Mapping Technical Safety Research at AI Companies · 2024-10-24T21:07:46.141Z · LW · GW

How does this paper suggest "that most safety research doesn't come from the top labs"?

Comment by Zach Stein-Perlman on What AI companies should do: Some rough ideas · 2024-10-22T04:50:49.019Z · LW · GW

Nice. I mostly agree on current margins — the more I mistrust a lab, the more I like transparency.

I observe that unilateral transparency is unnecessary on the view-from-the-inside if you know you're a reliably responsible lab. And some forms of transparency are costly. So the more responsible a lab seems, the more sympathetic we should be to it saying "we thought really hard and decided more unilateral transparency wouldn't be optimific." (For some forms of transparency.)

Comment by Zach Stein-Perlman on What AI companies should do: Some rough ideas · 2024-10-21T20:15:12.685Z · LW · GW

A related/overlapping thing I should collect: basic low-hanging fruit for safety. E.g.:

  • Have a high-level capability threshold at which securing model weights is very important
  • Evaluate whether your models are capable at agentic tasks; look at the score before releasing or internally deploying them
Comment by Zach Stein-Perlman on What AI companies should do: Some rough ideas · 2024-10-21T14:00:44.284Z · LW · GW

I may edit this post in a week based on feedback. You can suggest changes that don't merit their own LW comments here. That said, I'm not sure what feedback I want.