Posts

Meta: Frontier AI Framework 2025-02-03T22:00:17.103Z
Dario Amodei: On DeepSeek and Export Controls 2025-01-29T17:15:18.986Z
List of AI safety papers from companies, 2023–2024 2025-01-15T18:00:30.242Z
Anthropic leadership conversation 2024-12-20T22:00:45.229Z
o3 2024-12-20T18:30:29.448Z
DeepSeek beats o1-preview on math, ties on coding; will release weights 2024-11-20T23:50:26.597Z
Anthropic: Three Sketches of ASL-4 Safety Case Components 2024-11-06T16:00:06.940Z
The current state of RSPs 2024-11-04T16:00:42.630Z
Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority 2024-10-28T17:00:18.660Z
UK AISI: Early lessons from evaluating frontier AI systems 2024-10-25T19:00:21.689Z
Lab governance reading list 2024-10-25T18:00:28.346Z
IAPS: Mapping Technical Safety Research at AI Companies 2024-10-24T20:30:41.159Z
What AI companies should do: Some rough ideas 2024-10-21T14:00:10.412Z
Anthropic rewrote its RSP 2024-10-15T14:25:12.518Z
Model evals for dangerous capabilities 2024-09-23T11:00:00.866Z
OpenAI o1 2024-09-12T17:30:31.958Z
Demis Hassabis — Google DeepMind: The Podcast 2024-08-16T00:00:04.712Z
GPT-4o System Card 2024-08-08T20:30:52.633Z
AI labs can boost external safety research 2024-07-31T19:30:16.207Z
Safety consultations for AI lab employees 2024-07-27T15:00:27.276Z
New page: Integrity 2024-07-10T15:00:41.050Z
Claude 3.5 Sonnet 2024-06-20T18:00:35.443Z
Anthropic's Certificate of Incorporation 2024-06-12T13:00:30.806Z
Companies' safety plans neglect risks from scheming AI 2024-06-03T15:00:20.236Z
AI companies' commitments 2024-05-29T11:00:31.339Z
Maybe Anthropic's Long-Term Benefit Trust is powerless 2024-05-27T13:00:47.991Z
AI companies aren't really using external evaluators 2024-05-24T16:01:21.184Z
New voluntary commitments (AI Seoul Summit) 2024-05-21T11:00:41.794Z
DeepMind's "​​Frontier Safety Framework" is weak and unambitious 2024-05-18T03:00:13.541Z
DeepMind: Frontier Safety Framework 2024-05-17T17:30:02.504Z
Ilya Sutskever and Jan Leike resign from OpenAI [updated] 2024-05-15T00:45:02.436Z
Questions for labs 2024-04-30T22:15:55.362Z
Introducing AI Lab Watch 2024-04-30T17:00:12.652Z
Staged release 2024-04-17T16:00:19.402Z
DeepMind: Evaluating Frontier Models for Dangerous Capabilities 2024-03-21T03:00:31.599Z
OpenAI: Preparedness framework 2023-12-18T18:30:10.153Z
Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund 2023-10-25T15:20:52.765Z
OpenAI-Microsoft partnership 2023-10-03T20:01:44.795Z
Current AI safety techniques? 2023-10-03T19:30:54.481Z
ARC Evals: Responsible Scaling Policies 2023-09-28T04:30:37.140Z
How to think about slowing AI 2023-09-17T16:00:42.150Z
Cruxes for overhang 2023-09-14T17:00:56.609Z
Cruxes on US lead for some domestic AI regulation 2023-09-10T18:00:06.959Z
Which paths to powerful AI should be boosted? 2023-08-23T16:00:00.790Z
Which possible AI systems are relatively safe? 2023-08-21T17:00:27.582Z
AI labs' requests for input 2023-08-18T17:00:26.377Z
Boxing 2023-08-02T23:38:36.119Z
Frontier Model Forum 2023-07-26T14:30:02.018Z
My favorite AI governance research this year so far 2023-07-23T16:30:00.558Z
Incident reporting for AI safety 2023-07-19T17:00:57.429Z

Comments

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2025-02-13T01:00:31.947Z · LW · GW

xAI Risk Management Framework (Draft)

You're mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.

For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.

Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI's mitigations won't be robust to jailbreaks and xAI won't elicit performance on post-mitigation models well. E.g. it would be inadequate to just run a benchmark with a refusal-trained model, note that it almost always refuses, and call it a success. You need something like: a capable red-team tries to break the mitigations and use the model for harm, and either the red-team fails or it's so costly that the model doesn't make doing harm cheaper.

(For "Loss of Control," one of the two cited benchmarks was published today—I'm dubious that it measures what we care about but I've only spent ~3 mins engaging—and one has not yet been published. [Edit: and, like, on priors, I'm very skeptical of alignment evals/metrics, given the prospect of deceptive alignment, how we care about worst-case in addition to average-case behavior, etc.])

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2025-02-12T20:00:03.666Z · LW · GW

This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.

The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including MicrosoftMetaxAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.

RSPs essentially have four components: capability thresholds beyond which a model might be dangerous by default, an evaluation protocol to determine when models reach those thresholds, a plan for how to respond when various thresholds are reached, and accountability measures.

A maximally lazy RSP—a document intended to look like an RSP without making the company do anything differently—would have capability thresholds be vague or extremely high, evaluation be unspecified or low-quality, response be like we will make it safe rather than substantive mitigations or robustness guarantees, and no accountability measures. Such a policy would be little better than the company saying "we promise to deploy AIs safely." The new RSPs are basically like that.[1]

Some aspects of some RSPs that existed before the summit are slightly better.[2]

If existing RSPs are weak, how would a strong RSP be different?

  • Evals: eval should measure relevant capabilities (including cyber, bio, and scheming), evals should be sufficiently difficult, and labs should do good elicitation. (As a lower bar, the evals should exist; many companies say they will do evals but don't seem to have a plan for what evals to do.)
  • Response: misuse
    • Rather than just saying that you'll implement mitigations such that users can't access dangerous capabilities, say how you'll tell if your mitigations are good enough. For example, say that you'll have a skilled red-team attempt to elicit dangerous stuff from the post-mitigation model; ensure that it doesn't provide uplift, or the elicitation required is so involved that doing harm this way is no cheaper than without the AI
  • Response: control
  • Response: security (especially of model weights) — we're very far from securing model weights against determined sophisticated attackers, so:
    • Avoid expanding the Pareto frontier between powerful/dangerous and insecure much
    • Say no AI company has secured their model weights, and this imposes unacceptable risk. Commit that if all other developers were willing to implement super strong security, even though it's costly, you would too.
  • Thresholds
    • Should be low enough that your responses trigger before your models enable catastrophic harm
    • Ideally should be operationalized in evals, but this is genuinely hard
  • Accountability
    • Publish info on evals/elicitation (such that others can tell whether it's adequate)
    • Be transparent to an external auditor; have them review your evals and RSP and publicly comment on (1) adequacy and (2) whether your decisions about not publishing various details are reasonable
    • See also https://metr.org/rsp-key-components/#accountability

(What am I happy about in current RSPs? Briefly and without justification: yay Anthropic and maybe DeepMind on misuse stuff; somewhat yay Anthropic and OpenAI on their eval-commitments; somewhat yay DeepMind and maybe OpenAI on their actual evals; somewhat yay DeepMind on scheming/monitoring/control; maybe somewhat yay DeepMind and Anthropic on security (but not adequate).)

(Companies' other commitments aren't meaningful either.)

  1. ^

    Microsoft is supposed to conduct "robust evaluation of whether a model possesses tracked capabilities at high or critical levels, including through adversarial testing and systematic measurement using state-of-the-art methods," but no details on what evals they'll use. The response to dangerous capabilities is "Further review and mitigations required." The "Security measures" are underspecified but do take the situation seriously. The "Safety mitigations" are less substantive, unfortunately. There's not really accountability. Nothing on alignment or internal deployment.

    Meta has high vague risk thresholds, a vague evaluation plan, vague responses (e.g. "security protections to prevent hacking or exfiltration" and "mitigations to reduce risk to moderate level"), and no accountability. But they do suggest that if they make a model with dangerous capabilities, they won't release the weights and if they do deploy it externally (via API) they'll have decent robustness to jailbreaks — there are loopholes but they hadn't articulated that principle before. Nothing on alignment or internal deployment.

    xAI's policy begins "This is the first draft iteration of xAI’s risk management framework that we expect to apply to future models not currently in development." Misuse evals are cheap (like multiple-choice questions rather than uplift experiments); alignment evals reference unpublished papers and mention "Utility Functions" and "Corrigibility Score." Thresholds would be operationalized as eval results, which is nice, but they don't yet exist. For misuse, includes "Examples of safeguards or mitigations" (but not we'll know mitigations are adequate if a red team fails to break them or other details to suggest mitigations will be effective); no mitigations for alignment.

    Amazon: "Critical Capability Thresholds" are high and vague; evaluation is vague; mitigations are vague ("Upon determining that an Amazon model has reached a Critical Capability Threshold, we will implement a set of Safety Measures and Security Measures to prevent elicitation of the critical capability identified and to protect against inappropriate access risks. Safety Measures are designed to prevent the elicitation of the observed Critical Capabilities following deployment of the model. Security Measures are designed to prevent unauthorized access to model weights or guardrails implemented as part of the Safety Measures, which could enable a malicious actor to remove or bypass existing guardrails to exceed Critical Capability Thresholds."); there's not really accountability. Nothing on alignment or internal deployment. I appreciate the list of current security practices at the end of the document.

  2. ^

    Briefly and without justification:

    OpenAI is similarly vague on thresholds and evals and responses, with no accountability. (Also OpenAI is untrustworthy in general and has a bad history on Preparedness in particular.) But they've done almost-decent evals in the past.

    The DeepMind thing isn't really a commitment, and it doesn't say much about when to do evals, and the security levels are low (but this is no worse than everyone else being vague), and it doesn't have accountability, but it does mention deceptive alignment + control evals + monitoring, and DeepMind has done almost-decent evals in the past.

    The Anthropic thing is fine except it only goes up to ASL-3 and there's no control and little accountability (and the security isn't great (especially for systems after the first systems requiring ASL-3 security) but it's no worse than others).

    See The current state of RSPs modulo the DeepMind FSF update.

  3. ^

    DeepMind says something good on this (but it's not perfect and is only effective if they do good evals sufficiently frequently). Other RSPs don't seriously talk about risks from deceptive alignment.

Comment by Zach Stein-Perlman on nikola's Shortform · 2025-02-07T01:18:06.862Z · LW · GW

There also used to be a page for Preparedness: https://web.archive.org/web/20240603125126/https://openai.com/preparedness/. Now it redirects to the safety page above.

(Same for Superalignment but that's less interesting: https://web.archive.org/web/20240602012439/https://openai.com/superalignment/.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2025-02-04T19:00:33.945Z · LW · GW

DeepMind updated its Frontier Safety Framework (blogpost, framework, original framework). It associates "recommended security levels" to capability levels, but the security levels are low. It mentions deceptive alignment and control (both control evals as a safety case and monitoring as a mitigation); that's nice. The overall structure is like we'll do evals and make a safety case, with some capabilities mapped to recommended security levels in advance. It's not very commitment-y:

We intend to evaluate our most powerful frontier models regularly 

 

When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.

 

These recommended security levels reflect our current thinking and may be adjusted if our empirical understanding of the risks changes.

 

If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share information with appropriate government authorities where it will facilitate the development of safe AI.

Possibly the "Deployment Mitigations" section is more commitment-y.


I expect many more such policies will come out in the next week; I'll probably write a post about them all at the end rather than writing about them one by one, unless xAI or OpenAI says something particularly notable.

Comment by Zach Stein-Perlman on Mikhail Samin's Shortform · 2025-02-02T20:12:05.963Z · LW · GW

My guess is it's referring to Anthropic's position on SB 1047, or Dario's and Jack Clark's statements that it's too early for strong regulation, or how Anthropic's policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2025-01-31T19:30:18.898Z · LW · GW

o3-mini is out (blogpost, tweet). Performance isn't super noteworthy (on first glance), in part since we already knew about o3 performance.

Non-fact-checked quick takes on the system card:

the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)

Big if true (and if Preparedness had time to do elicitation and fix spurious failures)

If this is robust to jailbreaks, great, but presumably it's not, so low post-mitigation performance is far from sufficient for safety-from-misuse; post-mitigation performance isn't what we care about (we care about approximately what good jailbreakers get on the post-mitigation model).

No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there'll be a full-o3 system card soon.) [Edit: also maybe it's just not much more powerful than o1.]


  • 32% is a lot
  • The dataset is 1/4 T1 (easier), 1/2 T2, 1/4 T3 (harder); 28% on T3 means that there's not much difference between T1 and T3 to o3-mini (at least for the easiest-for-LMs quarter of T3)
    • Probably o3-mini is successfully using heuristics to get the right answer and could solve very few T3 problems in a deep or humanlike way
Comment by Zach Stein-Perlman on Tail SP 500 Call Options · 2025-01-23T19:46:23.348Z · LW · GW

Thanks. The tax treatment is terrible. And I would like more clarity on how transformative AI would affect S&P 500 prices (per this comment). But this seems decent (alongside AI-related calls) because 6 years is so long. 

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2025-01-22T01:00:11.622Z · LW · GW

I wrote this for someone but maybe it's helpful for others

What labs should do:

What labs are doing:

  • Evals: it's complicated; OpenAI, DeepMind, and Anthropic seem close to doing good model evals for dangerous capabilities; see DC evals: labs' practices plus the links in the top two rows (associated blogpost + model cards)
  • RSPs: all existing RSPs are super weak and you shouldn't expect them to matter; maybe see The current state of RSPs
  • Control: nothing is happening at the labs, except a little research at Anthropic and DeepMind
  • Security: nobody is prepared; nobody is trying to be prepared
  • Internal governance: you should basically model all of the companies as doing whatever leadership wants. In particular: (1) the OpenAI nonprofit is probably controlled by Sam Altman and will probably lose control soon and (2) possibly the Anthropic LTBT will matter but it doesn't seem to be working well.
  • Publishing safety research: DeepMind and Anthropic publish some good stuff but surprisingly little given how many safety researchers they employ; see List of AI safety papers from companies, 2023–2024

Resources:

  1. ^
Comment by Zach Stein-Perlman on Training on Documents About Reward Hacking Induces Reward Hacking · 2025-01-21T22:52:28.274Z · LW · GW

I think ideally we'd have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that's knowledgeable about that stuff, you use the knowledgeable version.

Related: https://docs.google.com/document/d/14M2lcN13R-FQVfvH55DHDuGlhVrnhyOo8O0YvO1dXXM/edit?tab=t.0#heading=h.21w31kpd1gl7

Somewhat related: https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais

Comment by Zach Stein-Perlman on Thoughts on the conservative assumptions in AI control · 2025-01-19T18:55:55.441Z · LW · GW

See The case for ensuring that powerful AIs are controlled.

Comment by Zach Stein-Perlman on Labs should be explicit about why they are building AGI · 2025-01-15T23:15:14.342Z · LW · GW

[Perfunctory review to get this post to the final phase]

Solid post. Still good. I think a responsible developer shouldn't unilaterally pause but I think it should talk about the crazy situation it's in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)

Comment by Zach Stein-Perlman on Reasons for and against working on technical AI safety at a frontier AI lab · 2025-01-06T05:30:50.537Z · LW · GW

One more consideration against (or an important part of "Bureaucracy"): sometimes your lab doesn't let you publish your research.

Comment by Zach Stein-Perlman on Anthropic's Certificate of Incorporation · 2025-01-04T17:05:46.148Z · LW · GW

Yep, the final phase-in date was in November 2024.

Comment by Zach Stein-Perlman on What’s the short timeline plan? · 2025-01-02T22:15:51.703Z · LW · GW

Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).

See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T00:16:26.490Z · LW · GW

Yeah. I agree/concede that you can explain why you can't convince people that their own work is useless. But if you're positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T00:02:17.183Z · LW · GW

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T17:06:01.554Z · LW · GW

I wonder whether John believes that well-liked research, e.g. Fabien's list, is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T15:30:42.274Z · LW · GW

I do not.

On the contrary, I think ~all of the "alignment researchers" I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don't know are likely substantially worse but not a ton.)

In particular I think all of the alignment-orgs-I'm-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.

This doesn't feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn't be helpful category they'd want to know (and, given the "obviously," would figure that out).


Possibly the situation is very different in academia or MATS-land; for now I'm just talking about the people around me.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T14:55:14.057Z · LW · GW

Yeah, I agree sometimes people decide to work on problems largely because they're tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I'm unconvinced of the flinching away or dishonest characterization.

Comment by Zach Stein-Perlman on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T02:06:51.260Z · LW · GW

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.

Yep. This post is not for me but I'll say a thing that annoyed me anyway:

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. "sensor tampering") or giving novel arguments that problems are difficult is socially rewarded.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-26T18:50:06.466Z · LW · GW

DeepSeek-V3 is out today, with weights and a paper published. Tweet thread, GitHub, report (GitHub, HuggingFace). It's big and mixture-of-experts-y; discussion here and here.

It was super cheap to train — they say 2.8M H800-hours or $5.6M (!!).

It's powerful:

It's cheap to run:

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-12-26T18:39:26.597Z · LW · GW

oops thanks

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-12-26T18:30:25.290Z · LW · GW

Update: the weights and paper are out. Tweet thread, GitHub, report (GitHub, HuggingFace). It's big and mixture-of-experts-y; thread on notable stuff.

It was super cheap to train — they say 2.8M H800-hours or $5.6M.

It's powerful:

It's cheap to run:

Comment by Zach Stein-Perlman on Hire (or Become) a Thinking Assistant · 2024-12-24T20:05:47.136Z · LW · GW

Every now and then (~5-10 minutes, or when I look actively distracted), briefly check in (where if I'm in-the-zone, this might just be a brief "Are you focused on what you mean to be?" from them, and a nod or "yeah" from me).

Some other prompts I use when being a [high-effort body double / low-effort metacognitive assistant / rubber duck]:

  • What are you doing?
  • What's your goal?
    • Or: what's your goal for the next n minutes?
    • Or: what should be your goal?
  • Are you stuck?
    • Follow-ups if they're stuck:
      • what should you do?
      • can I help?
      • have you considered asking someone for help?
        • If I don't know who could help, this is more like prompting them to figure out who could help; if I know the manager/colleague/friend who they should ask, I might use that person's name
  • Maybe you should x
  • If someone else was in your position, what would you advise them to do?
Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-22T15:30:15.826Z · LW · GW

All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: an Anthropic human.)

This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.

Comment by Zach Stein-Perlman on Mark Xu's Shortform · 2024-12-22T06:08:22.796Z · LW · GW

I disagree with Ben. I think the usage that Mark is talking about is a reference to Death with Dignity. A central example (written by me) is

it would be undignified if AI takes over because we didn't really try off-policy probes; maybe they just work really well; someone should figure that out

It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).

Comment by Zach Stein-Perlman on o3 · 2024-12-22T03:16:51.767Z · LW · GW

edit: wait likely it's RL; I'm confused

OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.

Sources:

Altman said

we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.

François Chollet (in the blogpost with the graph) said

Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

and

The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn't generate synthetic ARC data to improve their score.

An OpenAI staff member replied

Correct, can confirm "targeting" exclusively means including a (subset of) the public training set.

and further confirmed that "tuned" in the graph is

a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.

Another OpenAI staff member said

also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.

[heavily edited after first posting]

Comment by Zach Stein-Perlman on o3 · 2024-12-21T21:11:41.392Z · LW · GW

Welcome!

To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:59:44.055Z · LW · GW

I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).

Edit: maybe I don't know what he's referring to.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:58:01.739Z · LW · GW

I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)

Comment by Zach Stein-Perlman on o3 · 2024-12-21T01:03:52.642Z · LW · GW

I think they only have formal high and low versions for o3-mini

Edit: nevermind idk

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-21T01:02:52.640Z · LW · GW

Done, thanks.

Comment by Zach Stein-Perlman on Anthropic leadership conversation · 2024-12-20T22:36:05.299Z · LW · GW

I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.

Edit: actually I did another pass and edited out several more; thanks for the nudge.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T21:51:56.247Z · LW · GW

pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:52:08.414Z · LW · GW

It was one submission, apparently.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:51:25.796Z · LW · GW

and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning

The FrontierMath answers are numerical-ish ("problems have large numerical answers or complex mathematical objects as solutions"), so you can just check which answer the model wrote most frequently.

Comment by Zach Stein-Perlman on o3 · 2024-12-20T20:37:30.495Z · LW · GW

The obvious boring guess is best of n. Maybe you're asserting that using $4,000 implies that they're doing more than that.

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-16T22:29:14.402Z · LW · GW

My guess is they do kinda choose: in training, it's less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.

Edit: maybe this is different in procedures different from the one Rohin outlined.

Comment by Zach Stein-Perlman on [deleted post] 2024-12-16T21:32:31.794Z

The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.

Edit: debate is a technique for iterated amplification -- but that tag is terrible too, oh no

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-16T20:57:36.674Z · LW · GW

I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):

Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)

Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, -1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).

The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)

More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)

The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)

You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)

Comment by Zach Stein-Perlman on ARC Evals: Responsible Scaling Policies · 2024-12-14T23:45:29.382Z · LW · GW

This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic's RSP was published. It's since become clear that nobody knows how to write an RSP that's predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.

Comment by Zach Stein-Perlman on DeepMind: Model evaluation for extreme risks · 2024-12-14T23:45:11.627Z · LW · GW

This is the classic paper on model evals for dangerous capabilities.

On a skim, it's aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends "alignment evaluations" to determine models' propensity for misalignment, but such evals can't really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.

Comment by Zach Stein-Perlman on How much do you believe your results? · 2024-12-11T06:05:04.452Z · LW · GW

Interesting point, written up really really well. I don't think this post was practically useful for me but it's a good post regardless.

Comment by Zach Stein-Perlman on Thoughts on sharing information about language model capabilities · 2024-12-11T05:45:47.954Z · LW · GW

This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)

Comment by Zach Stein-Perlman on Zach Stein-Perlman's Shortform · 2024-12-11T05:25:35.362Z · LW · GW

To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it's very unlikely that the new model is dangerous.

DeepMind says it uses the safety-buffer plan (but it hasn't yet said it has operationalized thresholds/buffers).

Anthropic's original RSP used the safety-buffer plan; its new RSP doesn't really use either plan (kinda safety-buffer but it's very weak). (This is unfortunate.)

OpenAI seemed to use the test-the-actual-model plan.[1] This isn't going well. The 4o evals were rushed because OpenAI (reasonably) didn't want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn't be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn't seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn't notice before deployment....

(Yay OpenAI for honestly publishing eval results that don't look good.)

  1. ^

    It's not explicit. The PF says e.g. 'Only models with a post-mitigation score of "medium" or below can be deployed.' But it also mentions forecasting capabilities.

Comment by Zach Stein-Perlman on Untrusted smart models and trusted dumb models · 2024-12-10T06:30:21.505Z · LW · GW

This early control post introduced super important ideas: trusted monitoring plus the general point

if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

Comment by Zach Stein-Perlman on The current state of RSPs · 2024-12-09T00:03:41.117Z · LW · GW

Briefly:

  1. For OpenAI, I claim the cyber, CBRN, and persuasion Critical thresholds are very high (and also the cyber High threshold). I agree the autonomy Critical threshold doesn't feel so high.
  2. For Anthropic, most of the action is at ASL-4+, and they haven't even defined the ASL-4 standard yet. (So you can think of the current ASL-4 thresholds as infinitely high. I don't think "The thresholds are very high" for OpenAI was meant to imply a comparison to Anthropic; it's hard to compare since ASL-4 doesn't exist. Sorry for confusion.)
Comment by Zach Stein-Perlman on Habryka's Shortform Feed · 2024-12-08T03:23:43.626Z · LW · GW

Edit 2: after checking, I now believe the data strongly suggest FTX had a large negative effect on EA community metrics. (I still agree with Buck: "I don't like the fact that this essay is a mix of an insightful generic argument and a contentious specific empirical claim that I don't think you support strongly; it feels like the rhetorical strength of the former lends credence to the latter in a way that isn't very truth-tracking." And I disagree with habryka's claims that the effect of FTX is obvious.)


practically all metrics of the EA community's health and growth have sharply declined, and the extremely large and negative reputational effects have become clear.

I want more evidence on your claim that FTX had a major effect on EA reputation. Or: why do you believe it?


Edit: relevant thing habryka said that I didn't quote above:

For the EA surveys, these indicators looked very bleak: 

"Results demonstrated that FTX had decreased satisfaction by 0.5-1 points on a 10-point scale within the EA community"

"Among those aware of EA, attitudes remain positive and actually maybe increased post-FTX —though they were lower (d = -1.5, with large uncertainty) among those who were additionally aware of FTX."

"Most respondents reported continuing to trust EA organizations, though over 30% said they had substantially lost trust in EA public figures or leadership."

Comment by Zach Stein-Perlman on Model evals for dangerous capabilities · 2024-11-21T22:02:00.528Z · LW · GW

Another possible best practice for evals: use human+AI rather than [edit: just] AI alone. Many threat models involve human+AI and sometimes human+AI is substantially stronger than human alone and AI alone.

Comment by Zach Stein-Perlman on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T02:40:10.847Z · LW · GW

Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)