Posts

List of AI safety papers from companies, 2023–2024 2025-01-15T18:00:30.242Z
Anthropic leadership conversation 2024-12-20T22:00:45.229Z
o3 2024-12-20T18:30:29.448Z
DeepSeek beats o1-preview on math, ties on coding; will release weights 2024-11-20T23:50:26.597Z
Anthropic: Three Sketches of ASL-4 Safety Case Components 2024-11-06T16:00:06.940Z
The current state of RSPs 2024-11-04T16:00:42.630Z
Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority 2024-10-28T17:00:18.660Z
UK AISI: Early lessons from evaluating frontier AI systems 2024-10-25T19:00:21.689Z
Lab governance reading list 2024-10-25T18:00:28.346Z
IAPS: Mapping Technical Safety Research at AI Companies 2024-10-24T20:30:41.159Z
What AI companies should do: Some rough ideas 2024-10-21T14:00:10.412Z
Anthropic rewrote its RSP 2024-10-15T14:25:12.518Z
Model evals for dangerous capabilities 2024-09-23T11:00:00.866Z
OpenAI o1 2024-09-12T17:30:31.958Z
Demis Hassabis — Google DeepMind: The Podcast 2024-08-16T00:00:04.712Z
GPT-4o System Card 2024-08-08T20:30:52.633Z
AI labs can boost external safety research 2024-07-31T19:30:16.207Z
Safety consultations for AI lab employees 2024-07-27T15:00:27.276Z
New page: Integrity 2024-07-10T15:00:41.050Z
Claude 3.5 Sonnet 2024-06-20T18:00:35.443Z
Anthropic's Certificate of Incorporation 2024-06-12T13:00:30.806Z
Companies' safety plans neglect risks from scheming AI 2024-06-03T15:00:20.236Z
AI companies' commitments 2024-05-29T11:00:31.339Z
Maybe Anthropic's Long-Term Benefit Trust is powerless 2024-05-27T13:00:47.991Z
AI companies aren't really using external evaluators 2024-05-24T16:01:21.184Z
New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z
Frontier AI Regulation 2023-07-10T14:30:06.366Z
AI labs' statements on governance 2023-07-04T16:30:01.624Z

Comments

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2025-01-22T01:00:11.622Z · LW · GW

I wrote this for someone but maybe it's helpful for others

What labs should do:

What labs are doing:

  • Evals: it's complicated; OpenAI, DeepMind, and Anthropic seem close to doing good model evals for dangerous capabilities; see DC evals: labs' practices plus the links in the top two rows (associated blogpost + model cards)
  • RSPs: all existing RSPs are super weak and you shouldn't expect them to matter
  • Control: nothing is happening at the labs, except a little research at Anthropic and DeepMind
  • Security: nobody is prepared; nobody is trying to be prepared
  • Internal governance: you should basically model all of the companies as doing whatever leadership wants. In particular: (1) the OpenAI nonprofit is probably controlled by Sam Altman and will probably lose control soon and (2) possibly the Anthropic LTBT will matter but it doesn't seem to be working well.
  • Publishing safety research: DeepMind and Anthropic publish some good stuff but surprisingly little given how many safety researchers they employ; see List of AI safety papers from companies, 2023–2024

Resources:

  1. ^
Comment by Zach Stein-Perlman on Training on Documents About Reward Hacking Induces Reward Hacking · 2025-01-21T22:52:28.274Z · LW · GW

I think ideally we'd have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that's knowledgeable about that stuff, you use the knowledgeable version.

Related: https://docs.google.com/document/d/14M2lcN13R-FQVfvH55DHDuGlhVrnhyOo8O0YvO1dXXM/edit?tab=t.0#heading=h.21w31kpd1gl7

Somewhat related: https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais

Comment by Zach Stein-Perlman on Thoughts on the conservative assumptions in AI control · 2025-01-19T18:55:55.441Z · LW · GW

See The case for ensuring that powerful AIs are controlled.

Comment by Zach Stein-Perlman on Labs should be explicit about why they are building AGI · 2025-01-15T23:15:14.342Z · LW · GW

[Perfunctory review to get this post to the final phase]

Solid post. Still good. I think a responsible developer shouldn't unilaterally pause but I think it should talk about the crazy situation it's in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)

Comment by Zach Stein-Perlman on Reasons for and against working on technical AI safety at a frontier AI lab · 2025-01-06T05:30:50.537Z · LW · GW

One more consideration against (or an important part of "Bureaucracy"): sometimes your lab doesn't let you publish your research.

Comment by Zach Stein-Perlman on Anthropic's Certificate of Incorporation · 2025-01-04T17:05:46.148Z · LW · GW

Yep, the final phase-in date was in November 2024.

Comment by Zach Stein-Perlman on What’s the short timeline plan? · 2025-01-02T22:15:51.703Z · LW · GW

Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).

See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T00:16:26.490Z · LW · GW

Yeah. I agree/concede that you can explain why you can't convince people that their own work is useless. But if you're positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T00:02:17.183Z · LW · GW

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T17:06:01.554Z · LW · GW

I wonder whether John believes that well-liked research, e.g. Fabien's list, is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T15:30:42.274Z · LW · GW

I do not.

On the contrary, I think ~all of the "alignment researchers" I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don't know are likely substantially worse but not a ton.)

In particular I think all of the alignment-orgs-I'm-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.

This doesn't feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn't be helpful category they'd want to know (and, given the "obviously," would figure that out).


Possibly the situation is very different in academia or MATS-land; for now I'm just talking about the people around me.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T14:55:14.057Z · LW · GW

Yeah, I agree sometimes people decide to work on problems largely because they're tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I'm unconvinced of the flinching away or dishonest characterization.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T02:06:51.260Z · LW · GW

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.

Yep. This post is not for me but I'll say a thing that annoyed me anyway:

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. "sensor tampering") or giving novel arguments that problems are difficult is socially rewarded.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-26T18:50:06.466Z · LW · GW

DeepSeek-V3 is out today, with weights and a paper published. Tweet thread, GitHub, report (GitHub, HuggingFace). It's big and mixture-of-experts-y; discussion here and here.

It was super cheap to train — they say 2.8M H800-hours or $5.6M (!!).

It's powerful:

It's cheap to run:

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-12-26T18:39:26.597Z · LW · GW

oops thanks

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-12-26T18:30:25.290Z · LW · GW

Update: the weights and paper are out. Tweet thread, GitHub, report (GitHub, HuggingFace). It's big and mixture-of-experts-y; thread on notable stuff.

It was super cheap to train — they say 2.8M H800-hours or $5.6M.

It's powerful:

It's cheap to run:

Comment by Zach Stein-Perlman on Hire (or Become) a Thinking Assistant · 2024-12-24T20:05:47.136Z · LW · GW

Every now and then (~5-10 minutes, or when I look actively distracted), briefly check in (where if I'm in-the-zone, this might just be a brief "Are you focused on what you mean to be?" from them, and a nod or "yeah" from me).

Some other prompts I use when being a [high-effort body double / low-effort metacognitive assistant / rubber duck]:

  • What are you doing?
  • What's your goal?
    • Or: what's your goal for the next n minutes?
    • Or: what should be your goal?
  • Are you stuck?
    • Follow-ups if they're stuck:
      • what should you do?
      • can I help?
      • have you considered asking someone for help?
        • If I don't know who could help, this is more like prompting them to figure out who could help; if I know the manager/colleague/friend who they should ask, I might use that person's name
  • Maybe you should x
  • If someone else was in your position, what would you advise them to do?
Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-22T15:30:15.826Z · LW · GW

All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: an Anthropic human.)

This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.

Comment by Zach Stein-Perlman on Mark Xu's Shortform · 2024-12-22T06:08:22.796Z · LW · GW

I disagree with Ben. I think the usage that Mark is talking about is a reference to Death with Dignity. A central example (written by me) is

it would be undignified if AI takes over because we didn't really try off-policy probes; maybe they just work really well; someone should figure that out

It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).

Comment by Zach Stein-Perlman on o3 · 2024-12-22T03:16:51.767Z · LW · GW

edit: wait likely it's RL; I'm confused

OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.

Sources:

Altman said

we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.

François Chollet (in the blogpost with the graph) said

Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

and

The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn't generate synthetic ARC data to improve their score.

An OpenAI staff member replied

Correct, can confirm "targeting" exclusively means including a (subset of) the public training set.

and further confirmed that "tuned" in the graph is

a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.

Another OpenAI staff member said

also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.

[heavily edited after first posting]

Comment by Zach Stein-Perlman on o3 · 2024-12-21T21:11:41.392Z · LW · GW

Welcome!

To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:59:44.055Z · LW · GW

I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).

Edit: maybe I don't know what he's referring to.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:58:01.739Z · LW · GW

I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)

Comment by Zach Stein-Perlman on o3 · 2024-12-21T01:03:52.642Z · LW · GW

I think they only have formal high and low versions for o3-mini

Edit: nevermind idk

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:02:52.640Z · LW · GW

Done, thanks.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-20T22:36:05.299Z · LW · GW

I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.

Edit: actually I did another pass and edited out several more; thanks for the nudge.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T21:51:56.247Z · LW · GW

pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:52:08.414Z · LW · GW

It was one submission, apparently.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:51:25.796Z · LW · GW

and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning

The FrontierMath answers are numerical-ish ("problems have large numerical answers or complex mathematical objects as solutions"), so you can just check which answer the model wrote most frequently.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:37:30.495Z · LW · GW

The obvious boring guess is best of n. Maybe you're asserting that using $4,000 implies that they're doing more than that.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-16T22:29:14.402Z · LW · GW

My guess is they do kinda choose: in training, it's less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.

Edit: maybe this is different in procedures different from the one Rohin outlined.

Comment by Zach Stein-Perlman on [deleted post] 2024-12-16T21:32:31.794Z

The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.

Edit: debate is a technique for iterated amplification -- but that tag is terrible too, oh no

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-16T20:57:36.674Z · LW · GW

I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):

Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)

Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, -1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).

The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)

More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)

The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)

You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)

Comment by Zach Stein-Perlman on ARC Evals: Responsible Scaling Policies · 2024-12-14T23:45:29.382Z · LW · GW

This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic's RSP was published. It's since become clear that nobody knows how to write an RSP that's predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.

Comment by Zach Stein-Perlman on DeepMind: Model evaluation for extreme risks · 2024-12-14T23:45:11.627Z · LW · GW

This is the classic paper on model evals for dangerous capabilities.

On a skim, it's aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends "alignment evaluations" to determine models' propensity for misalignment, but such evals can't really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.

Comment by Zach Stein-Perlman on How much do you believe your results? · 2024-12-11T06:05:04.452Z · LW · GW

Interesting point, written up really really well. I don't think this post was practically useful for me but it's a good post regardless.

Comment by Zach Stein-Perlman on Thoughts on sharing information about language model capabilities · 2024-12-11T05:45:47.954Z · LW · GW

This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-11T05:25:35.362Z · LW · GW

To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it's very unlikely that the new model is dangerous.

DeepMind says it uses the safety-buffer plan (but it hasn't yet said it has operationalized thresholds/buffers).

Anthropic's original RSP used the safety-buffer plan; its new RSP doesn't really use either plan (kinda safety-buffer but it's very weak). (This is unfortunate.)

OpenAI seemed to use the test-the-actual-model plan.[1] This isn't going well. The 4o evals were rushed because OpenAI (reasonably) didn't want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn't be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn't seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn't notice before deployment....

(Yay OpenAI for honestly publishing eval results that don't look good.)

  1. ^

    It's not explicit. The PF says e.g. 'Only models with a post-mitigation score of "medium" or below can be deployed.' But it also mentions forecasting capabilities.

Comment by Zach Stein-Perlman on Untrusted smart models and trusted dumb models · 2024-12-10T06:30:21.505Z · LW · GW

This early control post introduced super important ideas: trusted monitoring plus the general point

if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

Comment by Zach Stein-Perlman on The current state of RSPs · 2024-12-09T00:03:41.117Z · LW · GW

Briefly:

  1. For OpenAI, I claim the cyber, CBRN, and persuasion Critical thresholds are very high (and also the cyber High threshold). I agree the autonomy Critical threshold doesn't feel so high.
  2. For Anthropic, most of the action is at ASL-4+, and they haven't even defined the ASL-4 standard yet. (So you can think of the current ASL-4 thresholds as infinitely high. I don't think "The thresholds are very high" for OpenAI was meant to imply a comparison to Anthropic; it's hard to compare since ASL-4 doesn't exist. Sorry for confusion.)
Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-12-08T03:23:43.626Z · LW · GW

Edit 2: after checking, I now believe the data strongly suggest FTX had a large negative effect on EA community metrics. (I still agree with Buck: "I don't like the fact that this essay is a mix of an insightful generic argument and a contentious specific empirical claim that I don't think you support strongly; it feels like the rhetorical strength of the former lends credence to the latter in a way that isn't very truth-tracking." And I disagree with habryka's claims that the effect of FTX is obvious.)


practically all metrics of the EA community's health and growth have sharply declined, and the extremely large and negative reputational effects have become clear.

I want more evidence on your claim that FTX had a major effect on EA reputation. Or: why do you believe it?


Edit: relevant thing habryka said that I didn't quote above:

For the EA surveys, these indicators looked very bleak: 

"Results demonstrated that FTX had decreased satisfaction by 0.5-1 points on a 10-point scale within the EA community"

"Among those aware of EA, attitudes remain positive and actually maybe increased post-FTX —though they were lower (d = -1.5, with large uncertainty) among those who were additionally aware of FTX."

"Most respondents reported continuing to trust EA organizations, though over 30% said they had substantially lost trust in EA public figures or leadership."

Comment by Zach Stein-Perlman on Model evals for dangerous capabilities · 2024-11-21T22:02:00.528Z · LW · GW

Another possible best practice for evals: use human+AI rather than [edit: just] AI alone. Many threat models involve human+AI and sometimes human+AI is substantially stronger than human alone and AI alone.

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T02:40:10.847Z · LW · GW

Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T02:40:02.374Z · LW · GW

Some context: interview translation; June FT article on DeepSeek; parent hedge fund Wikipedia page.

Comment by Zach Stein-Perlman on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:35:05.522Z · LW · GW

I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).

I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).

Comment by Zach Stein-Perlman on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:02:27.805Z · LW · GW

It would be crazy to show the untrusted model the old CoT when resampling.

Comment by Zach Stein-Perlman on [deleted post] 2024-11-18T05:02:55.099Z

Open Philanthropy's AI safety work tends toward large grants in the high hundreds of thousands or low millions, meaning individuals and organizations with lower funding needs won't be funded by them

This is directionally true but projects seeking less money should still apply to OP if relevant; a substantial minority of recent OP AI safety and GCR capacity building grants are <$100K.

Comment by Zach Stein-Perlman on Buck's Shortform · 2024-11-14T18:50:12.813Z · LW · GW

What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?

Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

Comment by Zach Stein-Perlman on Buck's Shortform · 2024-11-14T18:24:46.534Z · LW · GW

Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-11-09T04:37:19.378Z · LW · GW

We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.

Especially if the AI is even sandbagging on simply-coding when it thinks it's for safety research. And if it's not doing that, we can get some useful safety work out of it.

@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan's comments.