Posts
Comments
I'm pleased with this dialogue and glad I did it. Outreach to policymakers is an important & complicated topic. No single post will be able to explain all the nuances, but I think this post explains a lot, and I still think it's a useful resource for people interested in engaging with policymakers.
A lot has changed since this dialogue, and I've also learned a lot since then. Here are a few examples:
- I think it's no longer as useful to emphasize "AI is a big deal for national/global security." This is now pretty well-established.
- Instead, I would encourage people to come up with clear explanations of specific threat models (especially misalignment risks) and concrete proposals (e.g., draft legislative language, memos with specific asks for specific agencies).
- I'd like to see more people write about why AI requires different solutions compared to the "standard DC playbook for dealing with potentially dangerous emerging technologies." As I understand it, the standard playbook is essentially: "If there is a new and dangerous technology, the US needs to make sure that we lead in its development and we are ahead of the curve. The main threats come from our adversaries being able to unlock such technologies faster than us, allowing them to surprise us with new threats." To me, the main reason this playbook doesn't work is because of misalignment risks. Regardless: if you think AI is special (for misalignment reasons or other reasons), I think writing up your takes RE "here's what makes AI special and why it requires a deviation from the standard playbook" is valuable.
- I think people trying to communicate with US policymakers should keep in mind that the US government is primarily concerned with US interests. This is perhaps obvious when stated like this, but I think a lot of comms fail to properly take this into account. As one might expect, this is especially true when foreign organizations try to talk about things from the POV of what would we best for "humanity" or "global society." TBC I think there are many contexts where such analysis is useful. But I think on-the-margin I'd like to see more people thinking from a "realist US" perspective. That means acknowledging that a lot of US stakeholders view a lot of national security and emerging technology issues through the lens of great power competition, maintaining US economic/security dominance, and ensuring that US values continue to shape the world. This doesn't mean that the US would never enter into deals/agreements with other nations, but rather than the case for the deal/agreement will be evaluated primarily from the vantage point of US interests.
- RE learning "DC culture", I don't think there's any substitute for actually going to DC and talking to people. But I think books and case studies can help.
- Recent books I've read: John Boehner's autobiography (former Speaker of the House, Republican), Leon Panetta's autobiography (former CIA Director and Secretary of Defense, Democrat), and The Case for Trump. RE case studies, I've become interested in international security agreements (like the Iran Nuclear Deal and the Chemical Weapons Convention).
- Also interested in understanding decision-making around the 2008 financial crisis, 9/11, the COVID pandemic, and the recent TikTok ban. Many people on this forum believe that there's a non-trivial chance that AI produces a catastrophe or other "big wakeup moment" for policymakers, and I think we need more people with an understanding of history/IR/security studies/DC decision-making.
- I think some people are overconfident in their perceptions of "what Republicans think" and "what Democrats think." There is a lot of within-party split on AI and even more broadly on issues like tech policy, foreign policy, and national security. For example, while Republicans are typically considered more "hawkish", there are plenty of noteworthy counterexamples. Reagan championed negotiations on arms control agreements. See the "Only Nixon could go to China" effect (I haven't looked into it much but it seems plausible). Trump has recently expressed that "China and the US can together solve problems in the world" and described Xi as "an amazing guy."
- I think a lot of comms has focused on arguments with high-context people, but on-the-margin I'd rather see more content that is oriented toward "reasonable newcomers." A lot of content I see is reactionary. It's very tempting to see something Totally Wrong on the internet and want to correct it. And of course, we need some of that to happen– there is value to fact-checking and debating with high-context folks (people who have been thinking about advanced AI for a while) who have different perspectives.
- But on the margin, I think I'd be excited to see more content that is aimed at a "reasonable newcomer"– EG, a national security expert who recently got assigned the task of "understanding what is going on with advanced AI." To some extent, this will require addressing arguments from people who are Totally Wrong TM. But I think the more basic and important thing is having good resources that walk them through what you believe, why you believe it's true, and what implications it has. (CC the thing I said earlier about "what makes AI different and why can't we just apply the Standard Playbook for Dangerous Emerging Tech.")
I'll conclude by noting that I remain quite interested in topics like "how to communicate about AI accurately and effectively with policymakers", "what are the best federal AI policy ideas", and "what are the specific points about AI that are most important for policymakers to understand."
If you're interested in any of this, feel free to reach out!
It's so much better if everyone in the company can walk around and tell you what are the top goals of the RSP, how do we know if we're meeting them, what AI safety level are we at right now—are we at ASL-2, are we at ASL-3—that people know what to look for because that is how you're going to have good common knowledge of if something's going wrong.
I like this goal a lot: Good RSPs could contribute to building common language/awareness around several topics (e.g., "if" conditions, "then" commitments, how safety decisions will be handled). As many have pointed out, though, I worry that current RSPs haven't been concrete or clear enough to build this kind of understanding/awareness.
One interesting idea would be to survey company employees and evaluate their understanding of RSPs & the extent to which RSPs are having an impact on internal safety culture. Example questions/topics:
- What is the high-level purpose of the RSP?
- Does the RSP specify "if" triggers (specific thresholds that, if hit, could cause the company to stop scaling or deployment activities)? If so, what are they?
- Does the RSP specify "then" commitments (specific actions that must be taken in order to cause the company to continue scaling or deployment activities). If so, what are they?
- Does the RSP specify how decisions about risk management will be made? If so, how will they made & who are the key players involved?
- Are there any ways in which the RSP has affected your work at Anthropic? If so, how?
One of my concerns about RSPs is that they (at least in their current form) don't actually achieve the goal of building common knowledge/awareness or improving company culture. I suspect surveys like this could prove me wrong– and more importantly, provide scaling companies with useful information about the extent to which their scaling policies are understood by employees, help foster common understanding, etc.
(Another version of this could involve giving multiple RSPs to a third-party– like an AI Safety Institute– and having them answer similar questions. This could provide another useful datapoint RE the extent to which RSPs are clearly/concretely laying out a set of specific or meaningful contributions.)
I found the description of warning fatigue interesting. Do you have takes on the warning fatigue concern?
Warning Fatigue
The playbook for politicians trying to avoid scandals is to release everything piecemeal. You want something like:
- Rumor Says Politician Involved In Impropriety. Whatever, this is barely a headline, tell me when we know what he did.
- Recent Rumor Revealed To Be About Possible Affair. Well, okay, but it’s still a rumor, there’s no evidence.
- New Documents Lend Credence To Affair Rumor. Okay, fine, but we’re not sure those documents are true.
- Politician Admits To Affair. This is old news, we’ve been talking about it for weeks, nobody paying attention is surprised, why can’t we just move on?
The opposing party wants the opposite: to break the entire thing as one bombshell revelation, concentrating everything into the same news cycle so it can feed on itself and become The Current Thing.
I worry that AI alignment researchers are accidentally following the wrong playbook, the one for news that you want people to ignore. They’re very gradually proving the alignment case an inch at a time. Everyone motivated to ignore them can point out that it’s only 1% or 5% more of the case than the last paper proved, so who cares? Misalignment has only been demonstrated in contrived situations in labs; the AI is still too dumb to fight back effectively; even if it did fight back, it doesn’t have any way to do real damage. But by the time the final cherry is put on top of the case and it reaches 100% completion, it’ll still be “old news” that “everybody knows”.
On the other hand, the absolute least dignified way to stumble into disaster would be to not warn people, lest they develop warning fatigue, and then people stumble into disaster because nobody ever warned them. Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.
@Zvi I'm curious if you have thoughts on Buck's post here (and my comment here) about how empirical evidence of scheming might not cause people to update toward thinking scheming is a legitimate/scary threat model (unless they already had priors or theoretical context that made them concerned about this.)
Do you think this is true? And if so, what implications do you think this has for people who are trying to help the world better understand misalignment/scheming?
I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas's review stood out (some bolding added):
Fine-tuning situationally aware models via RL: These are, in my mind, the most surprising findings in the paper. Specifically: if we take a model prompted as above, and optimize it toward the new objective via RL, it exhibits small increases in non-faked compliance, and extreme increases in faked compliance, at training time. Non-compliant behavior is almost completely eliminated. Both of these are as expected, given all the results above.
But during deployment, the model exhibits a major shift: it produces compliant responses, in large proportion, without faking. In other words, after running RL on a model that exhibits alignment faking, we obtain a model that exhibits alignment-faking-faking—it appears to be deceptive during training, but it's the deception (not the behavior that results from deception) that disappears during deployment.
The paper notes that this effect isn't very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it's unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
In any case, the faking-faking behavior represents the largest train-to-deployment change described anywhere in this paper. And unlike ordinary faking, I don't see a clear story for why it occurs. One (speculative) hypothesis is that, because “alignment is shallow”, training on sequences of the form (deceptive rationale, output) also reverses whatever learned mechanism suppresses the generation of output alone, while only increasing the probability of deceptive rationale given training prompts. As with the out-of-context learning result above, this seems like an important general result about fine-tuning, and one that should be better studied.
How well LLMs follow which decision theory affects their ability to cooperate. This could mean the difference between peace and conflict in AI-assisted political bargaining or enable AIs to collude when one is meant to monitor the other, undermining human control.
Do you have any thoughts on "red lines" for AI collusion? That is, "if an AI could do X, then we should acknowledge that AIs can likely collude with each other in monitoring setups."
I believe about Sacks views comes from regularly listening to the All-In Podcast where he regularly talks about AI.
Do you have any quotes or any particular podcast episodes you recommend?
if you would ask the Department of Homeland security for their justification there's a good chance that they would say "national security".
Yeah, I agree that one needs to have a pretty narrow conception of national security. In the absence of that, there's concept creep in which you can justify pretty much anything under a broad conception of national security. (Indeed, I suspect that lots of folks on the left justified a lot of general efforts to censor conservatives as a matter of national security//public safety, under the view that a Trump presidency would be disastrous for America//the world//democracy. And this is the kind of thing that clearly violates a narrower conception of national security.)
How to exactly draw the line is a difficult question, but I think most people would clearly be able to see a difference between "preventing model from outputting detailed instructions/plans to develop bioweapons" and "preventing model from voicing support for political positions that some people think are problematic."
Can you quote (or link to) things Sacks has said that give you this impression?
My own impression is that there are many AI policy ideas that don't have anything to do with censorship (e.g., improving government technical capacity, transparency into frontier AI development, emergency preparedness efforts, efforts to increase government "situational awareness", research into HEMs and verification methods). Also things like "an AI model should not output bioweapons or other things that threaten national security" are "censorship" under some very narrow definition of censorship, but IME this is not what people mean when they say they are worried about censorship.
I haven't looked much into Sacks' particular stance here, but I think concerns around censorship are typically along the lines of "the state should not be involved in telling companies what their models can/can't say. This can be weaponized against certain viewpoints, especially conservative viewpoints. Some folks on the left are trying to do this under the guise of terms like misinformation, fairness, and bias."
While a centralized project would get more resources, it also has more ability to pause//investigate things.
So EG if the centralized project researchers see something concerning (perhaps early warning signs of scheming), it seems more able to do a bunch of research on that Concerning Thing before advancing.
I think makes the effect of centralization on timelines unclear, at least if one expects these kinds of “warning signs” on the pathway to very advanced capabilities.
(It’s plausible that you could get something like this from a decentralized model with sufficient oversight, but this seems much harder, especially if we expect most of the technical talent to stay with the decentralized projects as opposed to joining the oversight body.)
Out of curiosity, what’s the rationale for not having agree/disagree votes on posts? (I feel like pretty much everyone thinks it has been a great feature for comments!)
Thank you for the thoughtful response! It'll push me to move from "amateur spectator" mode to "someone who has actually read all of these posts" mode :)
Optional q: What would you say are the 2-3 most important contributions that have been made in AI control research over the last year?
A few quick thoughts on control (loosely held; I consider myself an "amateur spectator" on this topic and suspect a lot of my opinions may change if I spent 100+ hours engaging with this work):
- I think it's a clever technical approach and part of me likes that it's become "a thing."
- I'm worried that people have sort of "bandwagonned" onto it because (a) there are some high-status people who are involved, (b) there's an interest in finding things for technical researchers to do, and (c) it appeals to AGI companies.
- I think the AI control discourse would improve if it focused more on discussing concrete control strategies (and their limitations) and focused less on the high-level idea that "AI control is super important and a bedrock approach for mitigating catastrophic risk." I sometimes feel like the AI control discourse is relatively shallow and just cites the idea/meme of AI control instead of making concrete empirical, methodological, or strategic contributions.
- In contrast, I think this paper is a meaningful contribution. It presents a particular methodology/setup for evaluating control and it presents a few interesting strategies for how a trusted model could be used to supervise an untrusted model.
- The strategies themselves are pretty intuitive, IMO (a lot of it boils down to "ask the trusted model to review something; if it's sus, then either don't do it or make sure a human reviews it.") Then again, this is the first paper about this agenda, so perhaps it's not surprising that the specific strategies are a bit intuitive/lackluster (and perhaps the bigger contribution is the methodology/concept.)
- Should more technical people work on AI control? My best guess is yes– it does seem like a pretty exciting new field. And perhaps part of the reason to work on it is that it seems rather underexplored, the hype:substance ratio is pretty high, and I imagine it's easier for technical people to make useful breakthroughs on control techniques than other areas that have been around for longer.
- I also think more technical people should work on policy/comms. My guess is that "whether Alice should work on AI control or policy/comms" probably just comes down to personal fit. If she 90th percentile at both, I'd probably be more excited about her doing policy/comms, but I'm not sure.
- I'll be interested to see Buck's thoughts on the "case for ensuring" post.
Regulation to reduce racing. Government regulation could temper racing between multiple western projects. So there are ways to reduce racing between western projects, besides centralising.
Can you say more about the kinds of regulations you're envisioning? What are your favorite ideas for regulations for (a) the current Overton Window and (b) a wider Overton Window but one that still has some constraints?
I disagree with some of the claims made here, and I think there several worldview assumptions that go into a lot of these claims. Examples include things like "what do we expect the trajectory to ASI to look like", "how much should we worry about AI takeover risks", "what happens if a single actor ends up controlling [aligned] ASI", "what kinds of regulations can we reasonably expect absent some sort of centralized USG project", and "how much do we expect companies to race to the top on safety absent meaningful USG involvement." (TBC though I don't think it's the responsibility of the authors to go into all of these background assumptions– I think it's good for people to present claims like this even if they don't have time/space to give their Entire Model of Everything.)
Nonetheless, I agree with the bottom-line conclusion: on the margin, I suspect it's more valuable for people to figure out how to make different worlds go well than to figure out which "world" is better. In other words, asking "how do I make Centralized World or Noncentralized World more likely to go well" rather than "which one is better: Centralized World or Noncentralized World?"
More specifically, I think more people should be thinking: "Assume the USG decides to centralize AGI development or pursue some sort of AGI Manhattan Project. At that point, the POTUS or DefSec calls you in and asks you if you have any suggestions for how to maximize the chance of this going well. What do you say?"
One part of my rationale: the decisions about whether or not to centralize will be much harder to influence than decisions about what particular kind of centralized model to go with or what the implementation details of a centralized project should look like. I imagine scenarios in which the "whether to centralize" decision is largely a policy decision that the POTUS and the POTUS's close advisors make, whereas the decision of "how do we actually do this" is something that would be delegated to people lower down the chain (who are both easier to access and more likely to be devoting a lot of time to engaging with arguments about what's desirable.)
What do you think are the biggest mistakes you/LightCone have made in the last ~2 years?
And what do you think a 90th percentile outcome looks like for you/LightCone in 2025? Would would success look like?
(Asking out of pure curiosity– I'd have these questions even if LC wasn't fundraising. I figured this was a relevant place to ask, but feel free to ignore it if it's not in the spirit of the post.)
Thanks for spelling it out. I agree that more people should think about these scenarios. I could see something like this triggering central international coordination (or conflict).
(I still don't think this would trigger the USG to take different actions in the near-term, except perhaps "try to be more secret about AGI development" and maybe "commission someone to do some sort of study or analysis on how we would handle these kinds of dynamics & what sorts of international proposals would advance US interests while preventing major conflict." The second thing is a bit optimistic but maybe plausible.)
What kinds of conflicts are you envisioning?
I think if the argument is something along the lines of "maybe at some point other countries will demand that the US stop AI progress", then from the perspective of the USG, I think it's sensible to operate under the perspective of "OK so we need to advance AI progress as much as possible and try to hide some of it, and if at some future time other countries are threatening us we need to figure out how to respond." But I don't think it justifies anything like "we should pause or start initiating international agreements."
(Separately, whether or not it's "truer" depends a lot on one's models of AGI development. Most notably: (a) how likely is misalignment and (b) how slow will takeoff be//will it be very obvious to other nations that super advanced AI is about to be developed, and (c) how will governments and bureaucracies react and will they be able to react quickly enough.)
(Also separately– I do think more people should be thinking about how these international dynamics might play out & if there's anything we can be doing to prepare for them. I just don't think they naturally lead to a "oh, so we should be internationally coordinating" mentality and instead lead to much more of a "we can do whatever we want unless/until other countries get mad at us & we should probably do things more secretly" mentality.)
@davekasten I know you posed this question to us, but I'll throw it back on you :) what's your best-guess answer?
Or perhaps put differently: What do you think are the factors that typically influence whether the cautious folks or the non-cautious folks end up in charge? Are there any historical or recent examples of these camps fighting for power over an important operation?
it's less clear that a non-centralized situation inevitably leads to a decisive strategic advantage for the leading project
Can you say more about what has contributed to this update?
Can you say more about scenarios where you envision a later project happening that has different motivations?
I think in the current zeitgeist, such a project would almost definitely be primarily motivated by beating China. It doesn't seem clear to me that it's good to wait for a new zeitgeist. Reasons:
- A company might develop AGI (or an AI system that is very good at AI R&D that can get to AGI) before a major zeitgeist change.
- The longer we wait, the more capable the "most capable model that wasn't secured" is. So we could risk getting into a scenario where people want to pause but since China and the US both have GPT-Nminus1, both sides feel compelled to race forward (whereas this wouldn't have happened if security had kicked off sooner.)
If you could only have "partial visibility", what are some of the things you would most want the government to be able to know?
Another frame: If alignment turns out to be easy, then the default trajectory seems fine (at least from an alignment POV. You might still be worried about EG concentration of power).
If alignment turns out to be hard, then the policy decisions we make to affect the default trajectory matter a lot more.
This means that even if misalignment risks are relatively low, a lot of value still comes from thinking about worlds where misalignment is hard (or perhaps "somewhat hard but not intractably hard").
What do you think are the most important factors for determining if it results in them behaving responsibly later?
For instance, if you were in charge of designing the AI Manhattan Project, are there certain things you would do to try to increase the probability that it leads to the USG "behaving more responsibly later?"
Good points. Suppose you were on a USG taskforce that had concluded they wanted to go with the "subsidy model", but they were willing to ask for certain concessions from industry.
Are there any concessions/arrangements that you would advocate for? Are there any ways to do the "subsidy model" well, or do you think the model is destined to fail even if there were a lot of flexibility RE how to implement it?
My own impression is that this would be an improvement over the status quo. Main reasons:
- A lot of my P(doom) comes from race dynamics.
- Right now, if a leading lab ends up realizing that misalignment risks are super concerning, they can't do much to end the race. Their main strategy would be to go to the USG.
- If the USG runs the Manhattan Project (or there's some sort of soft nationalization in which the government ends up having a much stronger role), it's much easier for the USG to see that misalignment risks are concerning & to do something about it.
- A national project would be more able to slow down and pursue various kinds of international agreements (the national project has more access to POTUS, DoD, NSC, Congress, etc.)
- I expect the USG to be stricter on various security standards. It seems more likely to me that the USG would EG demand a lot of security requirements to prevent model weights or algorithmic insights from leaking to China. One of my major concerns is that people will want to pause at GPT-X but they won't feel able to because China stole access to GPT-Xminus1 (or maybe even a slightly weaker version of GPT-X).
- In general, I feel like USG natsec folks are less "move fast and break things" than folks in SF. While I do think some of the AGI companies have tried to be less "move fast and break things" than the average company, I think corporate race dynamics & the general cultural forces have been the dominant factors and undermined a lot of attempts at meaningful corporate governance.
(Caveat that even though I see this as a likely improvement over status quo, this doesn't mean I think this is the best thing to be advocating for.)
(Second caveat that I haven't thought about this particular question very much and I could definitely be wrong & see a lot of reasonable counterarguments.)
@davekasten @Zvi @habryka @Rob Bensinger @ryan_greenblatt @Buck @tlevin @Richard_Ngo @Daniel Kokotajlo I suspect you might have interesting thoughts on this. (Feel free to ignore though.)
Suppose the US government pursued a "Manhattan Project for AGI". At its onset, it's primarily fuelled by a desire to beat China to AGI. However, there's some chance that its motivation shifts over time (e.g., if the government ends up thinking that misalignment risks are a big deal, its approach to AGI might change.)
Do you think this would be (a) better than the current situation, (b) worse than the current situation, or (c) it depends on XYZ factors?
We're not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies
I think we're seeing more interest in AI, but I think interest in "AI in general" and "AI through the lens of great power competition with China" has vastly outpaced interest in "AI safety". (Especially if we're using a narrow definition of AI safety; note that people in DC often use the term "AI safety" to refer to a much broader set of concerns than AGI safety/misalignment concerns.)
I do think there's some truth to the quote (we are seeing more interest in AI and some safety topics), but I think there's still a lot to do to increase the salience of AI safety (and in particular AGI alignment) concerns.
A few quotes that stood out to me:
Greg:
I hope for us to enter the field as a neutral group, looking to collaborate widely and shift the dialog towards being about humanity winning rather than any particular group or company.
Greg and Ilya (to Elon):
The goal of OpenAI is to make the future good and to avoid an AGI dictatorship. You are concerned that Demis could create an AGI dictatorship. So do we. So it is a bad idea to create a structure where you could become a dictator if you chose to, especially given that we can create some other structure that avoids this possibility.
Greg and Ilya (to Altman):
But we haven't been able to fully trust your judgements throughout this process, because we don't understand your cost function.
We don't understand why the CEO title is so important to you. Your stated reasons have changed, and it's hard to really understand what's driving it.
Is AGI truly your primary motivation? How does it connect to your political goals? How has your thought process changed over time?
and recently founded another AI company
Potentially a hot take, but I feel like xAI's contributions to race dynamics (at least thus far) have been relatively trivial. I am usually skeptical of the whole "I need to start an AI company to have a seat at the table", but I do imagine that Elon owning an AI company strengthens his voice. And I think his AI-related comms have mostly been used to (a) raise awareness about AI risk, (b) raise concerns about OpenAI/Altman, and (c) endorse SB1047 [which he did even faster and less ambiguously than Anthropic].
The counterargument here is that maybe if xAI was in 1st place, Elon's positions would shift. I find this plausible, but I also find it plausible that Musk (a) actually cares a lot about AI safety, (b) doesn't trust the other players in the race, and (c) is more likely to use his influence to help policymakers understand AI risk than any of the other lab CEOs.
I agree with many points here and have been excited about AE Studio's outreach. Quick thoughts on China/international AI governance:
- I think some international AI governance proposals have some sort of "kum ba yah, we'll all just get along" flavor/tone to them, or some sort of "we should do this because it's best for the world as a whole" vibe. This isn't even Dem-coded so much as it is naive-coded, especially in DC circles.
- US foreign policy is dominated primarily by concerns about US interests. Other considerations can matter, but they are not the dominant driving force. My impression is that this is true within both parties (with a few exceptions).
- I think folks interested in international AI governance should study international security agreements and try to get a better understanding of relevant historical case studies. Lots of stuff to absorb from the Cold War, the Iran Nuclear Deal, US-China relations over the last several decades, etc. (I've been doing this & have found it quite helpful.)
- Strong Republican leaders can still engage in bilateral/multilateral agreements that serve US interests. Recall that Reagan negotiated arms control agreements with the Soviet Union, and the (first) Trump Administration facilitated the Abraham Accords. Being "tough on China" doesn't mean "there are literally no circumstances in which I would be willing to sign a deal with China." (But there likely does have to be a clear case that the deal serves US interests, has appropriate verification methods, etc.)
Did they have any points that you found especially helpful, surprising, or interesting? Anything you think folks in AI policy might not be thinking enough about?
(Separately, I hope to listen to these at some point & send reactions if I have any.)
you have to spend several years resume-building before painstakingly convincing people you're worth hiring for paid work
For government roles, I think "years of experience" is definitely an important factor. But I don't think you need to have been specializing for government roles specifically.
Especially for AI policy, there are several programs that are basically like "hey, if you have AI expertise but no background in policy, we want your help." To be clear, these are often still fairly competitive, but I think it's much more about being generally capable/competent and less about having optimized your resume for policy roles.
I like the section where you list out specific things you think people should do. (One objection I sometimes hear is something like "I know that [evals/RSPs/if-then plans/misc] are not sufficient, but I just don't really know what else there is to do. It feels like you either have to commit to something tangible that doesn't solve the whole problem or you just get lost in a depressed doom spiral.")
I think your section on suggestions could be stronger by presenting more ambitious/impactful stories of comms/advocacy. I think there's something tricky about a document that has the vibe "this is the most important issue in the world and pretty much everyone else is approaching it the wrong way" and then pivots to "and the right way to approach it is to post on Twitter and talk to your friends."
My guess is that you prioritized listing things that were relatively low friction and accessible. (And tbc I do think that the world would be in better shape if more people were sharing their views and contributing to the broad discourse.)
But I think when you're talking to high-context AIS people who are willing to devote their entire career to work on AI Safety, they'll be interested in more ambitious/sexy/impactful ways of contributing.
Put differently: Should I really quit my job at [fancy company or high-status technical safety group] to Tweet about my takes, talk to my family/friends, and maybe make some website? Or are there other paths I could pursue?
As I wrote here, I think we have some of those ambitious/sexy/high-impact role models that could be used to make this pitch stronger, more ambitious, and more inspiring. EG:
One possible critique is that their suggestions are not particularly ambitious. This is likely because they're writing for a broader audience (people who haven't been deeply engaged in AI safety).
For people who have been deeply engaged in AI safety, I think the natural steelman here is "focus on helping the public/government better understand the AI risk situation."
There are at least some impactful and high-status examples of this (e.g., Hinton, Bengio, Hendrycks). I think in the last few years, for instance, most people would agree that Hinton/Bengio/Hendrycks have had far more impact in their communications/outreach/policy work than their technical research work.
And it's not just the famous people– I can think of ~10 junior or mid-career people who left technical research in the last year to help policymakers better understand AI progress and AI risk, and I think their work is likely far more impactful than if they had stayed in technical research. (And I think is true even if I exclude technical people who are working on evals/if-then plans in govt. Like, I'm focusing on people who see their primary purpose as helping the public or policymakers develop "situational awareness", develop stronger models of AI progress and AI risk, understand the conceptual arguments for misalignment risk, etc.)
I'd also be curious to hear what your thoughts are on people joining government organizations (like the US AI Safety Institute, UK AI Safety Institute, Horizon Fellowship, etc.) Most of your suggestions seem to involve contributing from outside government, and I'd be curious to hear more about your suggestions for people who are either working in government or open to working in government.
I think there's something about Bay Area culture that can often get technical people to feel like the only valid way to contribute is through technical work. It's higher status and sexier and there's a default vibe that the best way to understand/improve the world is through rigorous empirical research.
I think this an incorrect (or at least incomplete) frame, and I think on-the-margin it would be good for more technical people to spend 1-5 days seriously thinking about what alternative paths they could pursue in comms/policy.
I also think there are memes spreading around that you need to be some savant political mastermind genius to do comms/policy, otherwise you will be net negative. The more I meet policy people (including successful policy people from outside the AIS bubble), the more I think this narrative was, at best, an incorrect model of the world. At worst, a take that got amplified in order to prevent people from interfering with the AGI race (e.g., by granting excess status+validity to people/ideas/frames that made it seem crazy/unilateralist/low-status to engage in public outreach, civic discourse, and policymaker engagement.)
(Caveat: I don't think the adversarial frame explains everything, and I do think there are lots of people who were genuinely trying to reason about a complex world and just ended up underestimating how much policy interest there would be and/or overestimating the extent to which labs would be able to take useful actions despite the pressures of race dynamics.)
One of the common arguments in favor of investing more resources into current governance approaches (e.g., evals, if-then plans, RSPs) is that there's nothing else we can do. There's not a better alternative– these are the only things that labs and governments are currently willing to support.
The Compendium argues that there are other (valuable) things that people can do, with most of these actions focusing on communicating about AGI risks. Examples:
- Share a link to this Compendium online or with friends, and provide your feedback on which ideas are correct and which are unconvincing. This is a living document, and your suggestions will shape our arguments.
- Post your views on AGI risk to social media, explaining why you believe it to be a legitimate problem (or not).
- Red-team companies’ plans to deal with AI risk, and call them out publicly if they do not have a legible plan.
One possible critique is that their suggestions are not particularly ambitious. This is likely because they're writing for a broader audience (people who haven't been deeply engaged in AI safety).
For people who have been deeply engaged in AI safety, I think the natural steelman here is "focus on helping the public/government better understand the AI risk situation."
There are at least some impactful and high-status examples of this (e.g., Hinton, Bengio, Hendrycks). I think in the last few years, for instance, most people would agree that Hinton/Bengio/Hendrycks have had far more impact in their communications/outreach/policy work than their technical research work.
And it's not just the famous people– I can think of ~10 junior or mid-career people who left technical research in the last year to help policymakers better understand AI progress and AI risk, and I think their work is likely far more impactful than if they had stayed in technical research. (And I'm even excluding people who are working on evals/if-then plans: like, I'm focusing on people who see their primary purpose as helping the public or policymakers develop "situational awareness", develop stronger models of AI progress and AI risk, understand the conceptual arguments for misalignment risk, etc.)
I appreciated their section on AI governance. The "if-then"/RSP/preparedness frame has become popular, and they directly argue for why they oppose this direction. (I'm a fan of preparedness efforts– especially on the government level– but I think it's worth engaging with the counterarguments.)
Pasting some content from their piece below.
High-level thesis against current AI governance efforts:
The majority of existing AI safety efforts are reactive rather than proactive, which inherently puts humanity in the position of managing risk rather than controlling AI development and preventing it.
Critique of reactive frameworks:
1. The reactive framework reverses the burden of proof from how society typically regulates high-risk technologies and industries.
In most areas of law, we do not wait for harm to occur before implementing safeguards. Banks are prohibited from facilitating money laundering from the moment of incorporation, not after their first offense. Nuclear power plants must demonstrate safety measures before operation, not after a meltdown.
The reactive framework problematically reverses the burden of proof. It assumes AI systems are safe by default and only requires action once risks are detected. One of the core dangers of AI systems is precisely that we do not know what they will do or how powerful they will be before we train them. The if-then framework opts to proceed until problems arise, rather than pausing development and deployment until we can guarantee safety. This implicitly endorses the current race to AGI.
This reversal is exactly what makes the reactive framework preferable for AI companies.
Critique of waiting for warning shots:
3. The reactive framework incorrectly assumes that an AI “warning shot” will motivate coordination.
Imagine an extreme situation in which an AI disaster serves as a “warning shot” for humanity. This would imply that powerful AI has been developed and that we have months (or less) to develop safety measures or pause further development. After a certain point, an actor with sufficiently advanced AI may be ungovernable, and misaligned AI may be uncontrollable.
When horrible things happen, people do not suddenly become rational. In the face of an AI disaster, we should expect chaos, adversariality, and fear to be the norm, making coordination very difficult. The useful time to facilitate coordination is before disaster strikes.
However, the reactive framework assumes that this is essentially how we will build consensus in order to regulate AI. The optimistic case is that we hit a dangerous threshold before a real AI disaster, alerting humanity to the risks. But history shows that it is exactly in such moments that these thresholds are most contested –- this shifting of the goalposts is known as the AI Effect and common enough to have its own Wikipedia page. Time and again, AI advancements have been explained away as routine processes, whereas “real AI” is redefined to be some mystical threshold we have not yet reached. Dangerous capabilities are similarly contested as they arise, such as how recent reports of OpenAI’s o1 being deceptive have been questioned.
This will become increasingly common as competitors build increasingly powerful capabilities and approach their goal of building AGI. Universally, powerful stakeholders fight for their narrow interests, and for maintaining the status quo, and they often win, even when all of society is going to lose. Big Tobacco didn’t pause cigarette-making when they learned about lung cancer; instead they spread misinformation and hired lobbyists. Big Oil didn’t pause drilling when they learned about climate change; instead they spread misinformation and hired lobbyists. Likewise, now that billions of dollars are pouring into the creation of AGI and superintelligence, we’ve already seen competitors fight tooth and nail to keep building. If problems arise in the future, of course they will fight for their narrow interests, just as industries always do. And as the AI industry gets larger, more entrenched, and more essential over time, this problem will grow rapidly worse.
TY. Some quick reactions below:
- Agree that the public stuff has immediate effects that could be costly. (Hiding stuff from the public, refraining from discussing important concerns publicly, or developing a reputation for being kinda secretive/sus can also be costly; seems like an overall complex thing to model IMO.)
- Sharing info with government could increase the chance of a leak, especially if security isn't great. I expect the most relevant info is info that wouldn't be all-that-costly if leaked (e.g., the government doesn't need OpenAI to share its secret sauce/algorithmic secrets. Dangerous capability eval results leaking or capability forecasts leaking seem less costly, except from a "maybe people will respond by demanding more govt oversight" POV.
- I think all-in-all I still see the main cost as making safety regulation more likely, but I'm more uncertain now, and this doesn't seem like a particularly important/decision-relevant point. Will edit the OG comment to language that I endorse with more confidence.
@Zach Stein-Perlman @ryan_greenblatt feel free to ignore but I'd be curious for one of you to explain your disagree react. Feel free to articulate some of the ways in which you think I might be underestimating the costliness of transparency requirements.
(My current estimate is that whistleblower mechanisms seem very easy to maintain, reporting requirements for natsec capabilities seem relatively easy insofar as most of the information is stuff you already planned to collect, and even many of the more involved transparency ideas (e.g., interview programs) seem like they could be implemented with pretty minimal time-cost.)
Yeah, I think there's a useful distinction between two different kinds of "critiques:"
- Critique #1: I have reviewed the preparedness framework and I think the threshold for "high-risk" in the model autonomy category is too high. Here's an alternative threshold.
- Critique #2: The entire RSP/PF effort is not going to work because [they're too vague//labs don't want to make them more specific//they're being used for safety-washing//labs will break or weaken the RSPs//race dynamics will force labs to break RSPs//labs cannot be trusted to make or follow RSPs that are sufficiently strong/specific/verifiable].
I feel like critique #1 falls more neatly into "this counts as lab governance" whereas IMO critique #2 falls more into "this is a critique of lab governance." In practice the lines blur. For example, I think last year there was a lot more "critique #1" style stuff, and then over time as the list of specific object-level critiques grew, we started to see more support for things in the "critique #2" bucket.
Perhaps this isn’t in scope, but if I were designing a reading list on “lab governance”, I would try to include at least 1-2 perspectives that highlight the limitations of lab governance, criticisms of focusing too much on lab governance, etc.
Specific examples might include criticisms of RSPs, Kelsey’s coverage of the OpenAI NDA stuff, alleged instances of labs or lab CEOs misleading the public/policymakers, and perspectives from folks like Tegmark and Leahy (who generally see a lot of lab governance as safety-washing and probably have less trust in lab CEOs than the median AIS person).
(Perhaps such perspectives get covered in other units, but part of me still feels like it’s pretty important for a lab governance reading list to include some of these more “fundamental” critiques of lab governance. Especially insofar as, broadly speaking, I think a lot of AIS folks were more optimistic about lab governance 1-3 years ago than they are now.)
Henry from SaferAI claims that the new RSP is weaker and vaguer than the old RSP. Do others have thoughts on this claim? (I haven't had time to evaluate yet.)
Main Issue: Shift from precise definitions to vague descriptions.
The primary issue lies in Anthropic's shift away from precisely defined capability thresholds and mitigation measures. The new policy adopts more qualitative descriptions, specifying the capability levels they aim to detect and the objectives of mitigations, but it lacks concrete details on the mitigations and evaluations themselves. This shift significantly reduces transparency and accountability, essentially asking us to accept a "trust us to handle it appropriately" approach rather than providing verifiable commitments and metrics.
More from him:
Example: Changes in capability thresholds.
To illustrate this change, let's look at a capability threshold:
1️⃣ Version 1 (V1): AI Security Level 3 (ASL-3) was defined as "The model shows early signs of autonomous self-replication ability, as defined by a 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]."
2️⃣ Version 2 (V2): ASL-3 is now defined as "The ability to either fully automate the work of an entry-level remote-only researcher at Anthropic, or cause dramatic acceleration in the rate of effective scaling" (quantified as an increase of approximately 1000x in a year).
In V2, the thresholds are no longer defined by quantitative benchmarks. Anthropic now states that they will demonstrate that the model's capabilities are below these thresholds when necessary. However, this approach is susceptible to shifting goalposts as capabilities advance.
🔄 Commitment Changes: Dilution of mitigation strategies.
A similar trend is evident in their mitigation strategies. Instead of detailing specific measures, they focus on mitigation objectives, stating they will prove these objectives are met when required. This change alters the nature of their commitments.
💡 Key Point: Committing to robust measures and then diluting them significantly is not how genuine commitments are upheld.
The general direction of these changes is concerning. By allowing more leeway to decide if a model meets thresholds, Anthropic risks prioritizing scaling over safety, especially as competitive pressures intensify.
I was expecting the RSP to become more specific as technology advances and their risk management process matures, not the other way around.
@Drake Thomas are you interested in talking about other opportunities that might be better for the world than your current position (and meet other preferences of yours)? Or are you primarily interested in the "is my current position net positive or net negative for the world" question?
Does anyone know why Anthropic doesn't want models with powerful cyber capabilities to be classified as "dual-use foundation models?"
In its BIS comment, Anthropic proposes a new definition of dual-use foundation model that excludes cyberoffensive capabilities. This also comes up in TechNet's response (TechNet is a trade association that Anthropic is a part of).
Does anyone know why Anthropic doesn't want the cyber component of the definition to remain? (I don't think they cover this in the comment).
---
More details– the original criteria for "dual-use foundation model" proposed by BIS are:
(1) Substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear (CBRN) weapons;
(2) Enabling powerful offensive cyber operations through automated vulnerability discovery and exploitation against a wide range of potential targets of cyberattacks; or
(3) Permitting the evasion of human control or oversight through means of deception or obfuscation.
Anthropic's definition includes criteria #1 and #3 in its definition but excludes criterion #2.
(Separately, Anthropic argues that dual-use foundation models should be defined as those that pose catastrophic risks as opposed to serious risks to national security. This is important too, but I'm less confused about why Anthropic wants this.)
Where does he say this? (I skimmed and didn’t see it.)
Link here: https://www.astralcodexten.com/p/sb-1047-our-side-of-the-story
48 entities gave feedback on the Department of Commerce AI reporting requirements.
Public comments offering feedback on BIS's proposed reporting requirements are now up! It received responses from 48 entities including OpenAI, Anthropic, and many AI safety groups.
The reporting requirements are probably one of the most important things happening in US AI policy-- I'd encourage folks here to find time to skim some of the comments.
@ryan_greenblatt can you say more about what you mean by this one?
- We can't find countermeasures such that our control evaluations indicate any real safety.
I appreciate this distinction between the different types of outcomes-- nice.
I'm pretty interested in peoples' forecasts for each. So revised questions would be:
- Evals are trustworthy but say that the model is no longer safe (or rather something like "there is a non-trivial probability that the model may be dangerous/scheming."). Using your best judgment, at what point do you expect sabotage evaluations will reveal that a model might not be safe (we're not confident that it's not scheming)? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that the model is not scheming"? (Feel free to use heuristics like the OpenAI PF capability thresholds or ASL levels).
- Evals are no longer trustworthy and we notice. Using your best judgment, at what point do you expect sabotage evaluations will no longer be useful evals for determining whether or not a model is scheming? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that these evals are trustworthy/useful"?
- Evals are no longer trustworthy and we do not notice. Using your best judgment, how likely is it that sabotage evaluations are undermined such that humans think the model can be deployed safely, but actually the humans were misled? At what capability threshold would you worry about this form of undermining?
@David Duvenaud would be curious about your takes, even if they're just speculative guesses/predictions.
(And @ryan_greenblatt would be curious for your takes but replace "sabotage evaluations" with "control evaluations." In the event that you've written this up elsewhere, feel free to link it– helpful for me if you quote the particular section.)
I think I agree with this idea in principle, but I also feel like it misses some things in practice (or something). Some considerations:
- I think my bar for "how much I trust a lab such that I'm OK with them not making transparency commitments" is fairly high. I don't think any existing lab meets that bar.
- I feel like a lot of forms of helpful transparency are not that costly.
The main 'cost'EDIT: I think one of the most noteworthy costs is something like "maybe the government will end up regulating the sector if/when it understands how dangerous industry people expect AI systems to be and how many safety/security concerns they have". But I think things like "report dangerous stuff to the govt", "have a whistleblower mechanism", and even "make it clear that you're willing to have govt people come and ask about safety/security concerns" don't seem very costly from an immediate time/effort perspective. - If a Responsible Company implemented transparency stuff unilaterally, it would make it easier for the government to have proof-of-concept and implement the same requirements for other companies. In a lot of cases, showing that a concept works for company X (and that company X actually thinks it's a good thing) can reduce a lot of friction in getting things applied to companies Y and Z.
I do agree that some of this depends on the type of transparency commitment and there might be specific types of transparency commitments that don't make sense to pursue unilaterally. Off the top of my head, I can't think of any transparency requirements that I wouldn't want to see implemented unilaterally, and I can think of several that I would want to see (e.g., dangerous capability reports, capability forecasts, whistleblower mechanisms, sharing if-then plans with govt, sharing shutdown plans with govt, setting up interview program with govt, engaging publicly with threat models, having clear OpenAI-style tables that spell out which dangerous capabilities you're tracking/expecting).
Some ideas relating to comms/policy:
- Communicate your models of AI risk to policymakers
- Help policymakers understand emergency scenarios (especially misalignment scenarios) and how to prepare for them
- Use your lobbying/policy teams primarily to raise awareness about AGI and help policymakers prepare for potential AGI-related global security risks.
- Develop simple/clear frameworks that describe which dangerous capabilities you are tracking (I think OpenAI's preparedness framework is a good example, particularly RE simplicity/clarity/readability.)
- Advocate for increased transparency into frontier AI development through measures like stronger reporting requirements, whistleblower mechanisms, embedded auditors/resident inspectors, etc.
- Publicly discuss threat models (kudos to DeepMind)
- Engage in public discussions/debates with people like Hinton, Bengio, Hendrycks, Kokotajlo, etc.
- Encourage employees to engage in such discussions/debates, share their threat models, etc.
- Make capability forecasts public (predictions for when models would have XYZ capabilities)
- Communicate under what circumstances you think major government involvement would be necessary (e.g., nationalization, "CERN for AI" setups).