Posts
Comments
Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)
Some context: June FT article on DeepSeek; parent hedge fund Wikipedia page.
I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).
I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).
It would be crazy to show the untrusted model the old CoT when resampling.
Open Philanthropy's AI safety work tends toward large grants in the high hundreds of thousands or low millions, meaning individuals and organizations with lower funding needs won't be funded by them
This is directionally true but projects seeking less money should still apply to OP if relevant; a substantial minority of recent OP AI safety and GCR capacity building grants are <$100K.
What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it's for safety research. And if it's not doing that, we can get some useful safety work out of it.
@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan's comments.
I mostly agree. But I think some kinds of autonomous weapons would make loss-of-control and coups easier. But boosting US security is good so the net effect is unclear. And that's very far from the recent news (and Anthropic has a Usage Policy, with exceptions, which disallows various uses — my guess is this is too strong on weapons).
I just claim that if so it's easy to catch. (And then (1) catching it is nice for various reasons and (2) maybe countermeasures work.)
I brainstormed sabotage-related threats with friends:
- Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that's high-level or has poor feedback loops
- Plausible; maybe not so bad
- Sandbag on ~all safety research tasks
- Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
- Sandbag/lie when validating that a model/deployment/whatever is safe
- idk
- When automating AI R&D, backdoor the model
- (this is takeover-attempt-y, not sandbagging-y)
- idk
- When deployed to automate/improve security (especially for the model's own weights): both sandbagging and code-backdooring/hacking/etc.
- idk
- Sandbagging on AI R&D to help another lab win the race (!?)
- Implausible
- ??
Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?
See [V1.1]
When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:
- Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
- Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not
Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)
[Edited]
My initial reactions on a quick read:
- Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
- Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
- Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate
Anthropic: The case for targeted regulation.
I like the "Urgency" section.
Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.
Edit: some strong versions of "incentivize companies to develop RSPs that are effective" would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.
After Miles Brundage left to do non-profit work, OpenAI disbanded the “AGI Readiness” team he had been leading, after previously disbanding the Superalignment team and reassigning the head of the preparedness team. I do worry both about what this implies, and that Miles Brundage may have made a mistake leaving given his position.
Do we know that this is the causal story? I think OpenAI decided disempower/disband/ignore Readiness and so Miles left is likely.
Some not-super-ambitious asks for labs (in progress):
- Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
- Have a high-level capability threshold at which securing model weights is very important
- Do safety research at all
- Have a safety plan at all; talk publicly about safety strategy at all
- Have a post like Core Views on AI Safety
- Have a post like The Checklist
- On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
- Whistleblower protections?
- [Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
- Publicly explain the processes and governance structures that determine deployment decisions
- (And ideally make those processes and structures good)
demonstrating its benigness is not the issue, it's actually making them benign
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you're benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you're evaluating the lab's most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.
. . .
We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you're there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it's still a good post on another problem.
Weaker:
- AI R&D threshold: yes, the threshold is much higher
- CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
- ASL-3 deployment standard: yes; the changes don't feel huge but the new standard doesn't feel very meaningful
- ASL-3 security standard: no (but both old and new are quite vague)
Vaguer: yes. But the old RSP didn't really have "precisely defined capability thresholds and mitigation measures." (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)
Internet comments sections where the comments are frequently insightful/articulate and not too full of obvious-garbage [or at least the top comments, given the comment-sorting algorithm]:
- LessWrong and EA Forum
- Daily Nous
- Stack Exchange [top comments only] (not asserting that it's reliable)
- Some subreddits?
Where else?
Are there blogposts on adjacent stuff, like why some internet [comments sections / internet-communities-qua-places-for-truthseeking] are better than others? Besides Well-Kept Gardens Die By Pacifism.
How does this paper suggest "that most safety research doesn't come from the top labs"?
Nice. I mostly agree on current margins — the more I mistrust a lab, the more I like transparency.
I observe that unilateral transparency is unnecessary on the view-from-the-inside if you know you're a reliably responsible lab. And some forms of transparency are costly. So the more responsible a lab seems, the more sympathetic we should be to it saying "we thought really hard and decided more unilateral transparency wouldn't be optimific." (For some forms of transparency.)
A related/overlapping thing I should collect: basic low-hanging fruit for safety. E.g.:
- Have a high-level capability threshold at which securing model weights is very important
- Evaluate whether your models are capable at agentic tasks; look at the score before releasing or internally deploying them
I may edit this post in a week based on feedback. You can suggest changes that don't merit their own LW comments here. That said, I'm not sure what feedback I want.
I'm confused/skeptical about this being relevant, I thought honesty is orthogonal to whether the model has access to its mental states.
I haven't read the paper. I think doing a little work on introspection is worthwhile. But I naively expect that it's quite intractable to do introspection science when we don't have access to the ground truth, and such cases are the only important ones. Relatedly, these tasks are trivialized by letting the model call itself, while letting the model call itself gives no help on welfare or "true preferences" introspection questions, if I understand correctly. [Edit: like, the inner states here aren’t the important hidden ones.]
Why work on introspection?
I don't know. I'm not directly familiar with CLTR's work — my excitement about them is deference-based. (Same for Horizon and TFS, mostly. I inside-view endorse the others I mention.)
Mea culpa. Sorry. Thanks.
Update: I think I've corrected this everywhere I've said it publicly.
No, in that post METR says it's excited about trying auditing, but "it was all under NDA" and "We also didn’t have the access necessary to perform a proper evaluation." Anthropic could commit to share with METR pre-deployment, give them better access, and let them publish stuff about their findings. I don't know if that would turn out well, but Anthropic could be trying much harder.
And auditing doesn't just mean model evals for dangerous capabilities — it could also be for security. (Or procedural stuff, but that doesn't solve the object-level problem.)
Sidenote: credit to Sam Bowman for saying
I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.
Oops, I forgot https://www.anthropic.com/rsp-updates. This is great. I really like that Anthropic shares "non-binding descriptions of our future ASL-3 safeguard plans."
(I agree that the intention is surely no more than 6 months; I'm mostly annoyed for legibility—things like this make it harder for me to say "Anthropic has clearly committed to X" for lab-comparison purposes—and LeCun-test reasons)
Thanks.
- I disagree, e.g. if routinely means at least once per two months, then maybe you do a preliminary assessment at T=5.5 months and then don't do the next until T=7.5 months and so don't do a comprehensive assessment for over 6 months.
- Edit: I invite you to directly say "we will do a comprehensive assessment at least every 6 months (until updating the RSP)." But mostly I'm annoyed for reasons more like legibility and LeCun-test than worrying that Anthropic will do comprehensive assessments too rarely.
- [no longer endorsed; mea culpa] I know this is what's going on in y'all's heads but I don't buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don't see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.
Correction:
On whether Anthropic uses chain-of-thought, I said "Yes, but not explicit, but implied by discussion of elicitation in the RSP and RSP evals report." This impression was also based on conversations with relevant Anthropic staff members. Today, Anthropic says "Some of our evaluations lacked some basic elicitation techniques such as best-of-N or chain-of-thought prompting." Anthropic also says it believes the elicitation gap is small, in tension with previous statements.
Old ASL-3 standard: see original RSP pp. 7-8 and 21-22. New ASL-3 standard. Quotes below.
Also note the original RSP said
We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the [RAND] report’s publication.
The RAND report was published in late May but this list never appeared.
Also the old RSP was about "containment" rather than just "security": containment is supposed to address risk of model self-exfiltration in addition to risk of weights being stolen. (But not really at ASL-3.) The new RSP is just about security.
Old ASL-3 security:
At ASL-3, labs should harden security against non-state attackers and provide some defense against state-level attackers. We commit to the following security themes. Similarly to ASL-2, this summary previews the key security measures at a high level and is based on the forthcoming RAND report. We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the report’s publication.
These requirements are cumulative above the ASL-2 requirements.
- At the software level, there should be strict inventory management tracking all software components used in development and deployment. Adhering to specifications like SSDF and SLSA, which includes a secure build pipeline and cryptographic signature enforcement at deployment time, must provide tamper-proof infrastructure. Frequent software updates and compliance monitoring must maintain security over time.
- On the hardware side, sourcing should focus on security-minded manufacturers and supply chains. Storage of sensitive weights must be centralized and restricted. Cloud network infrastructure must follow secure design patterns.
- Physical security should involve sweeping premises for intrusions. Hardware should be hardened to prevent external attacks on servers and devices.
- Segmentation should be implemented throughout the organization to a high threshold limiting blast radius from attacks. Access to weights should be indirect, via managed interfaces rather than direct downloads. Software should place limitations like restricting third-party services from accessing weights directly. Employees must be made aware that weight interactions are monitored. These controls should scale as an organization scales.
- Ongoing monitoring such as compromise assessments and blocking of malicious queries should be both automated and manual. Limits must be placed on the number of inferences for each set of credentials. Model interactions that could bypass monitoring must be avoided.
- Organizational policies must aim to enforce security through code, limiting reliance on manual compliance.
- To scale to meet the risk from people-vectors, insider threat programs should be hardened to require multi-party controls and incentivize reporting risks. Endpoints should be hardened to run only allowed software.
- Pen-testing, diverse security experience, concrete incident experience, and funding for substantial capacity all should contribute. A dedicated, resourced security red team with ongoing access to design and code must support testing for insider threats. Effective honeypots should be set up to detect attacks.
New ASL-3 security:
When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees,[1] and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).
The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.
To make the required showing, we will need to satisfy the following criteria:
- Threat modeling: Follow risk governance best practices, such as use of the MITRE ATT&CK Framework to establish the relationship between the identified threats, sensitive assets, attack vectors and, in doing so, sufficiently capture the resulting risks that must be addressed to protect model weights from theft attempts. As part of this requirement, we should specify our plans for revising the resulting threat model over time.
- Security frameworks: Align to and, as needed, extend industry-standard security frameworks for addressing identified risks, such as disclosure of sensitive information, tampering with accounts and assets, and unauthorized elevation of privileges with the appropriate controls. This includes:
- Perimeters and access controls: Building strong perimeters and access controls around sensitive assets to ensure AI models and critical systems are protected from unauthorized access. We expect this will include a combination of physical security, encryption, cloud security, infrastructure policy, access management, and weight access minimization and monitoring.
- Lifecycle security: Securing links in the chain of systems and software used to develop models, to prevent compromised components from being introduced and to ensure only trusted code and hardware is used. We expect this will include a combination of software inventory, supply chain security, artifact integrity, binary authorization, hardware procurement, and secure research development lifecycle.
- Monitoring: Proactively identifying and mitigating threats through ongoing and effective monitoring, testing for vulnerabilities, and laying traps for potential attackers. We expect this will include a combination of endpoint patching, product security testing, log management, asset monitoring, and intruder deception techniques.
- Resourcing: Investing sufficient resources in security. We expect meeting this standard of security to require roughly 5-10% of employees being dedicated to security and security-adjacent work.
- Existing guidance: Aligning where appropriate with existing guidance on securing model weights, including Securing AI Model Weights, Preventing Theft and Misuse of Frontier Models (2024); security recommendations like Deploying AI Systems Securely (CISA/NSA/FBI/ASD/CCCS/GCSB /GCHQ), ISO 42001, CSA’s AI Safety Initiative, and CoSAI; and standards frameworks like SSDF, SOC 2, NIST 800-53.
- Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.
- Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.
The new standard is more vague and meta, and Anthropic indeed abandoned various specifics from the original RSP. I think it is very far from passing the LeCun test; it's more like talking about security themes than making an object-level if-then commitment. I don't think Anthropic's security at ASL-3 is a huge deal, and I expect Anthropic be quite non-LeCun-y, but I think this is just too vague for me to feel good about.
In May Anthropic said "around 8% of all Anthropic employees are now working on security-adjacent areas and we expect that proportion to grow further as models become more economically valuable to attackers." Apparently they don't expect to increase that for ASL-3.
Edit: I changed my mind. I agree with Ryan's comment below. I wish we lived in a world where it's optimal to make strong object-level if-then commitments, but we don't, since we don't know which mitigations will be best to focus on. Tying hands to implement specific mitigations would waste resources. Better to make more meta commitments. (Strong versions require external auditing.)
- ^
We will implement robust insider risk controls to mitigate most insider risk, but consider mitigating risks from highly sophisticated state-compromised insiders to be out of scope for ASL-3. We are committed to further enhancing these protections as a part of our ASL-4 preparations.
Quick updates:
- I worried this was a loophole: "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company’s stockholders who have held a sufficient percentage of the company’s equity for a sufficient period of time." An independent person told me it's a normal Delaware law thing and it's only relevant if the Trust breaks the rules. Yay! This is good news, but I haven't verified it and I'm still somewhat suspicious (but this is mostly on me, not Anthropic, to figure out).
- I started trying to deduce stuff about who holds how many shares. I think this isn't really doable from public information, e.g. some Series C shares have voting power and some don't and I don't know how to tell which investors have which. But I got a better understanding of this stuff and I now tentatively think the Transfer Approval Threshold is super reasonable. Yay!
- Also even if I understood the VC investments perfectly I think I'd be confused by the (much larger?) investments by Amazon and Google
- If you're savvy in this and want to help, let me know and I'll share my rough notes
(On the other hand, it's pretty concerning that the Trust hasn't replaced Jason, who left in December, nor Paul, who left in April. Also note that it hasn't elected a board member to fill Luke's seat.)
I guess my LTBT-related asks for Anthropic are now:
- Commit that you'll promptly announce publicly if certain big LTBT-related changes happen
- Publish the Trust Agreement (and any other relevant documents), or at least let a non-Anthropic person I trust read it (under an NDA if you want) and publish whether they have big concerns
(This isn't high-priority for me; I just get annoyed when Anthropic brags about its LTBT without having justified that it's great.)
I think this post was underrated; I look back at it frequently: AI labs can boost external safety research. (It got some downvotes but no comments — let me know if it's wrong/bad.)
[Edit: it was at 15 karma.]
I agree safety-by-control kinda requires good security. But safety-by-alignment kinda requires good security too.
I am excited about donations to all of the following, in no particular order:
Interesting. Thanks. (If there's a citation for this, I'd probably include it in my discussions of evals best practices.)
Hopefully evals almost never trigger "spontaneous sandbagging"? Hacking and bio capabilities and so forth are generally more like carjacking than torture.
(Just the justification, of course; fixed.)
I agree noticing whether the model is refusing and, if so, bypassing refusals in some way is necessary for good evals (unless the refusal is super robust—such that you can depend on it for safety during deployment—and you're not worried about rogue deployments). But that doesn't require fine-tuning — possible alternatives include jailbreaking or few-shot prompting. Right?
(Fine-tuning is nice for eliciting stronger capabilities, but that's different.)
I think I was thinking:
- The war room transcripts will leak publicly
- Generals can secretly DM each other, while keeping up appearances in the shared channels
- If a general believes that all of their communication with their team will leak, we're be back to a unilateralist's curse situation: if a general thinks they should nuke, obviously they shouldn't say that to their team, so maybe they nuke unilaterally
- (Not obvious whether this is an infohazard)
- [Probably some true arguments about the payoff matrix and game theory increase P(mutual destruction). Also some false arguments about game theory — but maybe an infohazard warning makes those less likely to be posted too.]
(Also after I became a general I observed that I didn't know what my "launch code" was; I was hoping the LW team forgot to give everyone launch codes and this decreased P(nukes); saying this would would cause everyone to know their launch codes and maybe scare the other team.)
I don't think this is very relevant to real-world infohazards, because this is a game with explicit rules and because in the real world the low-hanging infohazards have been shared, but it seems relevant to mechanism design.
Update with two new responses:
I think this is 10 generals, 1 petrov, and one other person (either the other petrov or a citizen, not sure, wasn't super rigorous)
The post says generals' names will be published tomorrow.
No. I noticed ~2 more subtle infohazards and I was wishing for nobody to post them and I realized I can decrease that probability by making an infohazard warning.
I ask that you refrain from being the reason that security-by-obscurity fails, if you notice subtle infohazards.
Some true observations are infohazards, making destruction more likely. Please think carefully before posting observations. Even if you feel clever. You can post hashes here instead to later reveal how clever you were, if you need.
LOOSE LIPS SINK SHIPS
I think it’s better to be angry at the team that launched the nukes?