Posts
Comments
All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: Anthropic friends.)
This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.
I disagree with Ben. I think the usage that Mark is referring to is a reference to Death with Dignity. A central example of my usage is
it would be undignified if AI takes over because we didn't really try off-policy probes; maybe they just work; someone should figure that out
It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).
edit: wait likely it's RL; I'm confused
OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.
Sources:
Altman said
we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.
François Chollet (in the blogpost with the graph) said
Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn't generate synthetic ARC data to improve their score.
An OpenAI staff member replied
Correct, can confirm "targeting" exclusively means including a (subset of) the public training set.
and further confirmed that "tuned" in the graph is
a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.
Another OpenAI staff member said
also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint
So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.
[heavily edited after first posting]
Welcome!
To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.
I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).
Edit: actually I don't know what he's referring to.
I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)
I think they only have formal high and low versions for o3-mini
Edit: nevermind idk
Done, thanks.
I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.
Edit: actually I did another pass and edited out several more; thanks for the nudge.
pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
It was one submission, apparently.
and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning
The FrontierMath answers are numerical-ish ("problems have large numerical answers or complex mathematical objects as solutions"), so you can just check which answer the model wrote most frequently.
The obvious boring guess is best of n. Maybe you're asserting that using $4,000 implies that they're doing more than that.
My guess is they do kinda choose: in training, it's less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.
Edit: debate is a technique for iterated amplification -- but that tag is terrible too, oh no
I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):
Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)
Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, -1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).
The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)
More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)
The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)
You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic's RSP was published. It's since become clear that nobody knows how to write an RSP that's predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.
This is the classic paper on model evals for dangerous capabilities.
On a skim, it's aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends "alignment evaluations" to determine models' propensity for misalignment, but such evals can't really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.
Interesting point, written up really really well. I don't think this post was practically useful for me but it's a good post regardless.
This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)
To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it's very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn't yet said it has operationalized thresholds/buffers).
Anthropic's original RSP used the safety-buffer plan; its new RSP doesn't really use either plan (kinda safety-buffer but it's very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn't going well. The 4o evals were rushed because OpenAI (reasonably) didn't want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn't be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn't seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn't notice before deployment....
(Yay OpenAI for honestly publishing eval results that don't look good.)
- ^
It's not explicit. The PF says e.g. 'Only models with a post-mitigation score of "medium" or below can be deployed.' But it also mentions forecasting capabilities.
This early control post introduced super important ideas: trusted monitoring plus the general point
if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.
Briefly:
- For OpenAI, I claim the cyber, CBRN, and persuasion Critical thresholds are very high (and also the cyber High threshold). I agree the autonomy Critical threshold doesn't feel so high.
- For Anthropic, most of the action is at ASL-4+, and they haven't even defined the ASL-4 standard yet. (So you can think of the current ASL-4 thresholds as infinitely high. I don't think "The thresholds are very high" for OpenAI was meant to imply a comparison to Anthropic; it's hard to compare since ASL-4 doesn't exist. Sorry for confusion.)
Edit 2: after checking, I now believe the data strongly suggest FTX had a large negative effect on EA community metrics. (I still agree with Buck: "I don't like the fact that this essay is a mix of an insightful generic argument and a contentious specific empirical claim that I don't think you support strongly; it feels like the rhetorical strength of the former lends credence to the latter in a way that isn't very truth-tracking." And I disagree with habryka's claims that the effect of FTX is obvious.)
practically all metrics of the EA community's health and growth have sharply declined, and the extremely large and negative reputational effects have become clear.
I want more evidence on your claim that FTX had a major effect on EA reputation. Or: why do you believe it?
Edit: relevant thing habryka said that I didn't quote above:
For the EA surveys, these indicators looked very bleak:
"Results demonstrated that FTX had decreased satisfaction by 0.5-1 points on a 10-point scale within the EA community"
"Among those aware of EA, attitudes remain positive and actually maybe increased post-FTX —though they were lower (d = -1.5, with large uncertainty) among those who were additionally aware of FTX."
"Most respondents reported continuing to trust EA organizations, though over 30% said they had substantially lost trust in EA public figures or leadership."
Another possible best practice for evals: use human+AI rather than [edit: just] AI alone. Many threat models involve human+AI and sometimes human+AI is substantially stronger than human alone and AI alone.
Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)
Some context: interview translation; June FT article on DeepSeek; parent hedge fund Wikipedia page.
I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).
I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).
It would be crazy to show the untrusted model the old CoT when resampling.
Open Philanthropy's AI safety work tends toward large grants in the high hundreds of thousands or low millions, meaning individuals and organizations with lower funding needs won't be funded by them
This is directionally true but projects seeking less money should still apply to OP if relevant; a substantial minority of recent OP AI safety and GCR capacity building grants are <$100K.
What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it's for safety research. And if it's not doing that, we can get some useful safety work out of it.
@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan's comments.
I mostly agree. But I think some kinds of autonomous weapons would make loss-of-control and coups easier. But boosting US security is good so the net effect is unclear. And that's very far from the recent news (and Anthropic has a Usage Policy, with exceptions, which disallows various uses — my guess is this is too strong on weapons).
I just claim that if so it's easy to catch. (And then (1) catching it is nice for various reasons and (2) maybe countermeasures work.)
I brainstormed sabotage-related threats with friends:
- Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that's high-level or has poor feedback loops
- Plausible; maybe not so bad
- Sandbag on ~all safety research tasks
- Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
- Sandbag/lie when validating that a model/deployment/whatever is safe
- idk
- When automating AI R&D, backdoor the model
- (this is takeover-attempt-y, not sandbagging-y)
- idk
- When deployed to automate/improve security (especially for the model's own weights): both sandbagging and code-backdooring/hacking/etc.
- idk
- Sandbagging on AI R&D to help another lab win the race (!?)
- Implausible
- ??
Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?
See [V1.1]
When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:
- Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
- Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not
Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)
[Edited]
My initial reactions on a quick read:
- Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
- Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
- Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate
Anthropic: The case for targeted regulation.
I like the "Urgency" section.
Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.
Edit: some strong versions of "incentivize companies to develop RSPs that are effective" would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.
After Miles Brundage left to do non-profit work, OpenAI disbanded the “AGI Readiness” team he had been leading, after previously disbanding the Superalignment team and reassigning the head of the preparedness team. I do worry both about what this implies, and that Miles Brundage may have made a mistake leaving given his position.
Do we know that this is the causal story? I think OpenAI decided disempower/disband/ignore Readiness and so Miles left is likely.
Some not-super-ambitious asks for labs (in progress):
- Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
- Have a high-level capability threshold at which securing model weights is very important
- Do safety research at all
- Have a safety plan at all; talk publicly about safety strategy at all
- Have a post like Core Views on AI Safety
- Have a post like The Checklist
- On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
- Whistleblower protections?
- [Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
- Publicly explain the processes and governance structures that determine deployment decisions
- (And ideally make those processes and structures good)
demonstrating its benigness is not the issue, it's actually making them benign
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you're benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you're evaluating the lab's most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.
. . .
We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you're there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it's still a good post on another problem.
Weaker:
- AI R&D threshold: yes, the threshold is much higher
- CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
- ASL-3 deployment standard: yes; the changes don't feel huge but the new standard doesn't feel very meaningful
- ASL-3 security standard: no (but both old and new are quite vague)
Vaguer: yes. But the old RSP didn't really have "precisely defined capability thresholds and mitigation measures." (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)
Internet comments sections where the comments are frequently insightful/articulate and not too full of obvious-garbage [or at least the top comments, given the comment-sorting algorithm]:
- LessWrong and EA Forum
- Daily Nous
- Stack Exchange [top comments only] (not asserting that it's reliable)
- Some subreddits?
Where else?
Are there blogposts on adjacent stuff, like why some internet [comments sections / internet-communities-qua-places-for-truthseeking] are better than others? Besides Well-Kept Gardens Die By Pacifism.
How does this paper suggest "that most safety research doesn't come from the top labs"?
Nice. I mostly agree on current margins — the more I mistrust a lab, the more I like transparency.
I observe that unilateral transparency is unnecessary on the view-from-the-inside if you know you're a reliably responsible lab. And some forms of transparency are costly. So the more responsible a lab seems, the more sympathetic we should be to it saying "we thought really hard and decided more unilateral transparency wouldn't be optimific." (For some forms of transparency.)
A related/overlapping thing I should collect: basic low-hanging fruit for safety. E.g.:
- Have a high-level capability threshold at which securing model weights is very important
- Evaluate whether your models are capable at agentic tasks; look at the score before releasing or internally deploying them
I may edit this post in a week based on feedback. You can suggest changes that don't merit their own LW comments here. That said, I'm not sure what feedback I want.