Posts
Comments
Thank you for the thoughtful response! It'll push me to move from "amateur spectator" mode to "someone who has actually read all of these posts" mode :)
Optional q: What would you say are the 2-3 most important contributions that have been made in AI control research over the last year?
A few quick thoughts on control (loosely held; I consider myself an "amateur spectator" on this topic and suspect a lot of my opinions may change if I spent 100+ hours engaging with this work):
- I think it's a clever technical approach and part of me likes that it's become "a thing."
- I'm worried that people have sort of "bandwagonned" onto it because (a) there are some high-status people who are involved, (b) there's an interest in finding things for technical researchers to do, and (c) it appeals to AGI companies.
- I think the AI control discourse would improve if it focused more on discussing concrete control strategies (and their limitations) and focused less on the high-level idea that "AI control is super important and a bedrock approach for mitigating catastrophic risk." I sometimes feel like the AI control discourse is relatively shallow and just cites the idea/meme of AI control instead of making concrete empirical, methodological, or strategic contributions.
- In contrast, I think this paper is a meaningful contribution. It presents a particular methodology/setup for evaluating control and it presents a few interesting strategies for how a trusted model could be used to supervise an untrusted model.
- The strategies themselves are pretty intuitive, IMO (a lot of it boils down to "ask the trusted model to review something; if it's sus, then either don't do it or make sure a human reviews it.") Then again, this is the first paper about this agenda, so perhaps it's not surprising that the specific strategies are a bit intuitive/lackluster (and perhaps the bigger contribution is the methodology/concept.)
- Should more technical people work on AI control? My best guess is yes– it does seem like a pretty exciting new field. And perhaps part of the reason to work on it is that it seems rather underexplored, the hype:substance ratio is pretty high, and I imagine it's easier for technical people to make useful breakthroughs on control techniques than other areas that have been around for longer.
- I also think more technical people should work on policy/comms. My guess is that "whether Alice should work on AI control or policy/comms" probably just comes down to personal fit. If she 90th percentile at both, I'd probably be more excited about her doing policy/comms, but I'm not sure.
- I'll be interested to see Buck's thoughts on the "case for ensuring" post.
Regulation to reduce racing. Government regulation could temper racing between multiple western projects. So there are ways to reduce racing between western projects, besides centralising.
Can you say more about the kinds of regulations you're envisioning? What are your favorite ideas for regulations for (a) the current Overton Window and (b) a wider Overton Window but one that still has some constraints?
I disagree with some of the claims made here, and I think there several worldview assumptions that go into a lot of these claims. Examples include things like "what do we expect the trajectory to ASI to look like", "how much should we worry about AI takeover risks", "what happens if a single actor ends up controlling [aligned] ASI", "what kinds of regulations can we reasonably expect absent some sort of centralized USG project", and "how much do we expect companies to race to the top on safety absent meaningful USG involvement." (TBC though I don't think it's the responsibility of the authors to go into all of these background assumptions– I think it's good for people to present claims like this even if they don't have time/space to give their Entire Model of Everything.)
Nonetheless, I agree with the bottom-line conclusion: on the margin, I suspect it's more valuable for people to figure out how to make different worlds go well than to figure out which "world" is better. In other words, asking "how do I make Centralized World or Noncentralized World more likely to go well" rather than "which one is better: Centralized World or Noncentralized World?"
More specifically, I think more people should be thinking: "Assume the USG decides to centralize AGI development or pursue some sort of AGI Manhattan Project. At that point, the POTUS or DefSec calls you in and asks you if you have any suggestions for how to maximize the chance of this going well. What do you say?"
One part of my rationale: the decisions about whether or not to centralize will be much harder to influence than decisions about what particular kind of centralized model to go with or what the implementation details of a centralized project should look like. I imagine scenarios in which the "whether to centralize" decision is largely a policy decision that the POTUS and the POTUS's close advisors make, whereas the decision of "how do we actually do this" is something that would be delegated to people lower down the chain (who are both easier to access and more likely to be devoting a lot of time to engaging with arguments about what's desirable.)
What do you think are the biggest mistakes you/LightCone have made in the last ~2 years?
And what do you think a 90th percentile outcome looks like for you/LightCone in 2025? Would would success look like?
(Asking out of pure curiosity– I'd have these questions even if LC wasn't fundraising. I figured this was a relevant place to ask, but feel free to ignore it if it's not in the spirit of the post.)
Thanks for spelling it out. I agree that more people should think about these scenarios. I could see something like this triggering central international coordination (or conflict).
(I still don't think this would trigger the USG to take different actions in the near-term, except perhaps "try to be more secret about AGI development" and maybe "commission someone to do some sort of study or analysis on how we would handle these kinds of dynamics & what sorts of international proposals would advance US interests while preventing major conflict." The second thing is a bit optimistic but maybe plausible.)
What kinds of conflicts are you envisioning?
I think if the argument is something along the lines of "maybe at some point other countries will demand that the US stop AI progress", then from the perspective of the USG, I think it's sensible to operate under the perspective of "OK so we need to advance AI progress as much as possible and try to hide some of it, and if at some future time other countries are threatening us we need to figure out how to respond." But I don't think it justifies anything like "we should pause or start initiating international agreements."
(Separately, whether or not it's "truer" depends a lot on one's models of AGI development. Most notably: (a) how likely is misalignment and (b) how slow will takeoff be//will it be very obvious to other nations that super advanced AI is about to be developed, and (c) how will governments and bureaucracies react and will they be able to react quickly enough.)
(Also separately– I do think more people should be thinking about how these international dynamics might play out & if there's anything we can be doing to prepare for them. I just don't think they naturally lead to a "oh, so we should be internationally coordinating" mentality and instead lead to much more of a "we can do whatever we want unless/until other countries get mad at us & we should probably do things more secretly" mentality.)
@davekasten I know you posed this question to us, but I'll throw it back on you :) what's your best-guess answer?
Or perhaps put differently: What do you think are the factors that typically influence whether the cautious folks or the non-cautious folks end up in charge? Are there any historical or recent examples of these camps fighting for power over an important operation?
it's less clear that a non-centralized situation inevitably leads to a decisive strategic advantage for the leading project
Can you say more about what has contributed to this update?
Can you say more about scenarios where you envision a later project happening that has different motivations?
I think in the current zeitgeist, such a project would almost definitely be primarily motivated by beating China. It doesn't seem clear to me that it's good to wait for a new zeitgeist. Reasons:
- A company might develop AGI (or an AI system that is very good at AI R&D that can get to AGI) before a major zeitgeist change.
- The longer we wait, the more capable the "most capable model that wasn't secured" is. So we could risk getting into a scenario where people want to pause but since China and the US both have GPT-Nminus1, both sides feel compelled to race forward (whereas this wouldn't have happened if security had kicked off sooner.)
If you could only have "partial visibility", what are some of the things you would most want the government to be able to know?
Another frame: If alignment turns out to be easy, then the default trajectory seems fine (at least from an alignment POV. You might still be worried about EG concentration of power).
If alignment turns out to be hard, then the policy decisions we make to affect the default trajectory matter a lot more.
This means that even if misalignment risks are relatively low, a lot of value still comes from thinking about worlds where misalignment is hard (or perhaps "somewhat hard but not intractably hard").
What do you think are the most important factors for determining if it results in them behaving responsibly later?
For instance, if you were in charge of designing the AI Manhattan Project, are there certain things you would do to try to increase the probability that it leads to the USG "behaving more responsibly later?"
Good points. Suppose you were on a USG taskforce that had concluded they wanted to go with the "subsidy model", but they were willing to ask for certain concessions from industry.
Are there any concessions/arrangements that you would advocate for? Are there any ways to do the "subsidy model" well, or do you think the model is destined to fail even if there were a lot of flexibility RE how to implement it?
My own impression is that this would be an improvement over the status quo. Main reasons:
- A lot of my P(doom) comes from race dynamics.
- Right now, if a leading lab ends up realizing that misalignment risks are super concerning, they can't do much to end the race. Their main strategy would be to go to the USG.
- If the USG runs the Manhattan Project (or there's some sort of soft nationalization in which the government ends up having a much stronger role), it's much easier for the USG to see that misalignment risks are concerning & to do something about it.
- A national project would be more able to slow down and pursue various kinds of international agreements (the national project has more access to POTUS, DoD, NSC, Congress, etc.)
- I expect the USG to be stricter on various security standards. It seems more likely to me that the USG would EG demand a lot of security requirements to prevent model weights or algorithmic insights from leaking to China. One of my major concerns is that people will want to pause at GPT-X but they won't feel able to because China stole access to GPT-Xminus1 (or maybe even a slightly weaker version of GPT-X).
- In general, I feel like USG natsec folks are less "move fast and break things" than folks in SF. While I do think some of the AGI companies have tried to be less "move fast and break things" than the average company, I think corporate race dynamics & the general cultural forces have been the dominant factors and undermined a lot of attempts at meaningful corporate governance.
(Caveat that even though I see this as a likely improvement over status quo, this doesn't mean I think this is the best thing to be advocating for.)
(Second caveat that I haven't thought about this particular question very much and I could definitely be wrong & see a lot of reasonable counterarguments.)
@davekasten @Zvi @habryka @Rob Bensinger @ryan_greenblatt @Buck @tlevin @Richard_Ngo @Daniel Kokotajlo I suspect you might have interesting thoughts on this. (Feel free to ignore though.)
Suppose the US government pursued a "Manhattan Project for AGI". At its onset, it's primarily fuelled by a desire to beat China to AGI. However, there's some chance that its motivation shifts over time (e.g., if the government ends up thinking that misalignment risks are a big deal, its approach to AGI might change.)
Do you think this would be (a) better than the current situation, (b) worse than the current situation, or (c) it depends on XYZ factors?
We're not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies
I think we're seeing more interest in AI, but I think interest in "AI in general" and "AI through the lens of great power competition with China" has vastly outpaced interest in "AI safety". (Especially if we're using a narrow definition of AI safety; note that people in DC often use the term "AI safety" to refer to a much broader set of concerns than AGI safety/misalignment concerns.)
I do think there's some truth to the quote (we are seeing more interest in AI and some safety topics), but I think there's still a lot to do to increase the salience of AI safety (and in particular AGI alignment) concerns.
A few quotes that stood out to me:
Greg:
I hope for us to enter the field as a neutral group, looking to collaborate widely and shift the dialog towards being about humanity winning rather than any particular group or company.
Greg and Ilya (to Elon):
The goal of OpenAI is to make the future good and to avoid an AGI dictatorship. You are concerned that Demis could create an AGI dictatorship. So do we. So it is a bad idea to create a structure where you could become a dictator if you chose to, especially given that we can create some other structure that avoids this possibility.
Greg and Ilya (to Altman):
But we haven't been able to fully trust your judgements throughout this process, because we don't understand your cost function.
We don't understand why the CEO title is so important to you. Your stated reasons have changed, and it's hard to really understand what's driving it.
Is AGI truly your primary motivation? How does it connect to your political goals? How has your thought process changed over time?
and recently founded another AI company
Potentially a hot take, but I feel like xAI's contributions to race dynamics (at least thus far) have been relatively trivial. I am usually skeptical of the whole "I need to start an AI company to have a seat at the table", but I do imagine that Elon owning an AI company strengthens his voice. And I think his AI-related comms have mostly been used to (a) raise awareness about AI risk, (b) raise concerns about OpenAI/Altman, and (c) endorse SB1047 [which he did even faster and less ambiguously than Anthropic].
The counterargument here is that maybe if xAI was in 1st place, Elon's positions would shift. I find this plausible, but I also find it plausible that Musk (a) actually cares a lot about AI safety, (b) doesn't trust the other players in the race, and (c) is more likely to use his influence to help policymakers understand AI risk than any of the other lab CEOs.
I agree with many points here and have been excited about AE Studio's outreach. Quick thoughts on China/international AI governance:
- I think some international AI governance proposals have some sort of "kum ba yah, we'll all just get along" flavor/tone to them, or some sort of "we should do this because it's best for the world as a whole" vibe. This isn't even Dem-coded so much as it is naive-coded, especially in DC circles.
- US foreign policy is dominated primarily by concerns about US interests. Other considerations can matter, but they are not the dominant driving force. My impression is that this is true within both parties (with a few exceptions).
- I think folks interested in international AI governance should study international security agreements and try to get a better understanding of relevant historical case studies. Lots of stuff to absorb from the Cold War, the Iran Nuclear Deal, US-China relations over the last several decades, etc. (I've been doing this & have found it quite helpful.)
- Strong Republican leaders can still engage in bilateral/multilateral agreements that serve US interests. Recall that Reagan negotiated arms control agreements with the Soviet Union, and the (first) Trump Administration facilitated the Abraham Accords. Being "tough on China" doesn't mean "there are literally no circumstances in which I would be willing to sign a deal with China." (But there likely does have to be a clear case that the deal serves US interests, has appropriate verification methods, etc.)
Did they have any points that you found especially helpful, surprising, or interesting? Anything you think folks in AI policy might not be thinking enough about?
(Separately, I hope to listen to these at some point & send reactions if I have any.)
you have to spend several years resume-building before painstakingly convincing people you're worth hiring for paid work
For government roles, I think "years of experience" is definitely an important factor. But I don't think you need to have been specializing for government roles specifically.
Especially for AI policy, there are several programs that are basically like "hey, if you have AI expertise but no background in policy, we want your help." To be clear, these are often still fairly competitive, but I think it's much more about being generally capable/competent and less about having optimized your resume for policy roles.
I like the section where you list out specific things you think people should do. (One objection I sometimes hear is something like "I know that [evals/RSPs/if-then plans/misc] are not sufficient, but I just don't really know what else there is to do. It feels like you either have to commit to something tangible that doesn't solve the whole problem or you just get lost in a depressed doom spiral.")
I think your section on suggestions could be stronger by presenting more ambitious/impactful stories of comms/advocacy. I think there's something tricky about a document that has the vibe "this is the most important issue in the world and pretty much everyone else is approaching it the wrong way" and then pivots to "and the right way to approach it is to post on Twitter and talk to your friends."
My guess is that you prioritized listing things that were relatively low friction and accessible. (And tbc I do think that the world would be in better shape if more people were sharing their views and contributing to the broad discourse.)
But I think when you're talking to high-context AIS people who are willing to devote their entire career to work on AI Safety, they'll be interested in more ambitious/sexy/impactful ways of contributing.
Put differently: Should I really quit my job at [fancy company or high-status technical safety group] to Tweet about my takes, talk to my family/friends, and maybe make some website? Or are there other paths I could pursue?
As I wrote here, I think we have some of those ambitious/sexy/high-impact role models that could be used to make this pitch stronger, more ambitious, and more inspiring. EG:
One possible critique is that their suggestions are not particularly ambitious. This is likely because they're writing for a broader audience (people who haven't been deeply engaged in AI safety).
For people who have been deeply engaged in AI safety, I think the natural steelman here is "focus on helping the public/government better understand the AI risk situation."
There are at least some impactful and high-status examples of this (e.g., Hinton, Bengio, Hendrycks). I think in the last few years, for instance, most people would agree that Hinton/Bengio/Hendrycks have had far more impact in their communications/outreach/policy work than their technical research work.
And it's not just the famous people– I can think of ~10 junior or mid-career people who left technical research in the last year to help policymakers better understand AI progress and AI risk, and I think their work is likely far more impactful than if they had stayed in technical research. (And I think is true even if I exclude technical people who are working on evals/if-then plans in govt. Like, I'm focusing on people who see their primary purpose as helping the public or policymakers develop "situational awareness", develop stronger models of AI progress and AI risk, understand the conceptual arguments for misalignment risk, etc.)
I'd also be curious to hear what your thoughts are on people joining government organizations (like the US AI Safety Institute, UK AI Safety Institute, Horizon Fellowship, etc.) Most of your suggestions seem to involve contributing from outside government, and I'd be curious to hear more about your suggestions for people who are either working in government or open to working in government.
I think there's something about Bay Area culture that can often get technical people to feel like the only valid way to contribute is through technical work. It's higher status and sexier and there's a default vibe that the best way to understand/improve the world is through rigorous empirical research.
I think this an incorrect (or at least incomplete) frame, and I think on-the-margin it would be good for more technical people to spend 1-5 days seriously thinking about what alternative paths they could pursue in comms/policy.
I also think there are memes spreading around that you need to be some savant political mastermind genius to do comms/policy, otherwise you will be net negative. The more I meet policy people (including successful policy people from outside the AIS bubble), the more I think this narrative was, at best, an incorrect model of the world. At worst, a take that got amplified in order to prevent people from interfering with the AGI race (e.g., by granting excess status+validity to people/ideas/frames that made it seem crazy/unilateralist/low-status to engage in public outreach, civic discourse, and policymaker engagement.)
(Caveat: I don't think the adversarial frame explains everything, and I do think there are lots of people who were genuinely trying to reason about a complex world and just ended up underestimating how much policy interest there would be and/or overestimating the extent to which labs would be able to take useful actions despite the pressures of race dynamics.)
One of the common arguments in favor of investing more resources into current governance approaches (e.g., evals, if-then plans, RSPs) is that there's nothing else we can do. There's not a better alternative– these are the only things that labs and governments are currently willing to support.
The Compendium argues that there are other (valuable) things that people can do, with most of these actions focusing on communicating about AGI risks. Examples:
- Share a link to this Compendium online or with friends, and provide your feedback on which ideas are correct and which are unconvincing. This is a living document, and your suggestions will shape our arguments.
- Post your views on AGI risk to social media, explaining why you believe it to be a legitimate problem (or not).
- Red-team companies’ plans to deal with AI risk, and call them out publicly if they do not have a legible plan.
One possible critique is that their suggestions are not particularly ambitious. This is likely because they're writing for a broader audience (people who haven't been deeply engaged in AI safety).
For people who have been deeply engaged in AI safety, I think the natural steelman here is "focus on helping the public/government better understand the AI risk situation."
There are at least some impactful and high-status examples of this (e.g., Hinton, Bengio, Hendrycks). I think in the last few years, for instance, most people would agree that Hinton/Bengio/Hendrycks have had far more impact in their communications/outreach/policy work than their technical research work.
And it's not just the famous people– I can think of ~10 junior or mid-career people who left technical research in the last year to help policymakers better understand AI progress and AI risk, and I think their work is likely far more impactful than if they had stayed in technical research. (And I'm even excluding people who are working on evals/if-then plans: like, I'm focusing on people who see their primary purpose as helping the public or policymakers develop "situational awareness", develop stronger models of AI progress and AI risk, understand the conceptual arguments for misalignment risk, etc.)
I appreciated their section on AI governance. The "if-then"/RSP/preparedness frame has become popular, and they directly argue for why they oppose this direction. (I'm a fan of preparedness efforts– especially on the government level– but I think it's worth engaging with the counterarguments.)
Pasting some content from their piece below.
High-level thesis against current AI governance efforts:
The majority of existing AI safety efforts are reactive rather than proactive, which inherently puts humanity in the position of managing risk rather than controlling AI development and preventing it.
Critique of reactive frameworks:
1. The reactive framework reverses the burden of proof from how society typically regulates high-risk technologies and industries.
In most areas of law, we do not wait for harm to occur before implementing safeguards. Banks are prohibited from facilitating money laundering from the moment of incorporation, not after their first offense. Nuclear power plants must demonstrate safety measures before operation, not after a meltdown.
The reactive framework problematically reverses the burden of proof. It assumes AI systems are safe by default and only requires action once risks are detected. One of the core dangers of AI systems is precisely that we do not know what they will do or how powerful they will be before we train them. The if-then framework opts to proceed until problems arise, rather than pausing development and deployment until we can guarantee safety. This implicitly endorses the current race to AGI.
This reversal is exactly what makes the reactive framework preferable for AI companies.
Critique of waiting for warning shots:
3. The reactive framework incorrectly assumes that an AI “warning shot” will motivate coordination.
Imagine an extreme situation in which an AI disaster serves as a “warning shot” for humanity. This would imply that powerful AI has been developed and that we have months (or less) to develop safety measures or pause further development. After a certain point, an actor with sufficiently advanced AI may be ungovernable, and misaligned AI may be uncontrollable.
When horrible things happen, people do not suddenly become rational. In the face of an AI disaster, we should expect chaos, adversariality, and fear to be the norm, making coordination very difficult. The useful time to facilitate coordination is before disaster strikes.
However, the reactive framework assumes that this is essentially how we will build consensus in order to regulate AI. The optimistic case is that we hit a dangerous threshold before a real AI disaster, alerting humanity to the risks. But history shows that it is exactly in such moments that these thresholds are most contested –- this shifting of the goalposts is known as the AI Effect and common enough to have its own Wikipedia page. Time and again, AI advancements have been explained away as routine processes, whereas “real AI” is redefined to be some mystical threshold we have not yet reached. Dangerous capabilities are similarly contested as they arise, such as how recent reports of OpenAI’s o1 being deceptive have been questioned.
This will become increasingly common as competitors build increasingly powerful capabilities and approach their goal of building AGI. Universally, powerful stakeholders fight for their narrow interests, and for maintaining the status quo, and they often win, even when all of society is going to lose. Big Tobacco didn’t pause cigarette-making when they learned about lung cancer; instead they spread misinformation and hired lobbyists. Big Oil didn’t pause drilling when they learned about climate change; instead they spread misinformation and hired lobbyists. Likewise, now that billions of dollars are pouring into the creation of AGI and superintelligence, we’ve already seen competitors fight tooth and nail to keep building. If problems arise in the future, of course they will fight for their narrow interests, just as industries always do. And as the AI industry gets larger, more entrenched, and more essential over time, this problem will grow rapidly worse.
TY. Some quick reactions below:
- Agree that the public stuff has immediate effects that could be costly. (Hiding stuff from the public, refraining from discussing important concerns publicly, or developing a reputation for being kinda secretive/sus can also be costly; seems like an overall complex thing to model IMO.)
- Sharing info with government could increase the chance of a leak, especially if security isn't great. I expect the most relevant info is info that wouldn't be all-that-costly if leaked (e.g., the government doesn't need OpenAI to share its secret sauce/algorithmic secrets. Dangerous capability eval results leaking or capability forecasts leaking seem less costly, except from a "maybe people will respond by demanding more govt oversight" POV.
- I think all-in-all I still see the main cost as making safety regulation more likely, but I'm more uncertain now, and this doesn't seem like a particularly important/decision-relevant point. Will edit the OG comment to language that I endorse with more confidence.
@Zach Stein-Perlman @ryan_greenblatt feel free to ignore but I'd be curious for one of you to explain your disagree react. Feel free to articulate some of the ways in which you think I might be underestimating the costliness of transparency requirements.
(My current estimate is that whistleblower mechanisms seem very easy to maintain, reporting requirements for natsec capabilities seem relatively easy insofar as most of the information is stuff you already planned to collect, and even many of the more involved transparency ideas (e.g., interview programs) seem like they could be implemented with pretty minimal time-cost.)
Yeah, I think there's a useful distinction between two different kinds of "critiques:"
- Critique #1: I have reviewed the preparedness framework and I think the threshold for "high-risk" in the model autonomy category is too high. Here's an alternative threshold.
- Critique #2: The entire RSP/PF effort is not going to work because [they're too vague//labs don't want to make them more specific//they're being used for safety-washing//labs will break or weaken the RSPs//race dynamics will force labs to break RSPs//labs cannot be trusted to make or follow RSPs that are sufficiently strong/specific/verifiable].
I feel like critique #1 falls more neatly into "this counts as lab governance" whereas IMO critique #2 falls more into "this is a critique of lab governance." In practice the lines blur. For example, I think last year there was a lot more "critique #1" style stuff, and then over time as the list of specific object-level critiques grew, we started to see more support for things in the "critique #2" bucket.
Perhaps this isn’t in scope, but if I were designing a reading list on “lab governance”, I would try to include at least 1-2 perspectives that highlight the limitations of lab governance, criticisms of focusing too much on lab governance, etc.
Specific examples might include criticisms of RSPs, Kelsey’s coverage of the OpenAI NDA stuff, alleged instances of labs or lab CEOs misleading the public/policymakers, and perspectives from folks like Tegmark and Leahy (who generally see a lot of lab governance as safety-washing and probably have less trust in lab CEOs than the median AIS person).
(Perhaps such perspectives get covered in other units, but part of me still feels like it’s pretty important for a lab governance reading list to include some of these more “fundamental” critiques of lab governance. Especially insofar as, broadly speaking, I think a lot of AIS folks were more optimistic about lab governance 1-3 years ago than they are now.)
Henry from SaferAI claims that the new RSP is weaker and vaguer than the old RSP. Do others have thoughts on this claim? (I haven't had time to evaluate yet.)
Main Issue: Shift from precise definitions to vague descriptions.
The primary issue lies in Anthropic's shift away from precisely defined capability thresholds and mitigation measures. The new policy adopts more qualitative descriptions, specifying the capability levels they aim to detect and the objectives of mitigations, but it lacks concrete details on the mitigations and evaluations themselves. This shift significantly reduces transparency and accountability, essentially asking us to accept a "trust us to handle it appropriately" approach rather than providing verifiable commitments and metrics.
More from him:
Example: Changes in capability thresholds.
To illustrate this change, let's look at a capability threshold:
1️⃣ Version 1 (V1): AI Security Level 3 (ASL-3) was defined as "The model shows early signs of autonomous self-replication ability, as defined by a 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]."
2️⃣ Version 2 (V2): ASL-3 is now defined as "The ability to either fully automate the work of an entry-level remote-only researcher at Anthropic, or cause dramatic acceleration in the rate of effective scaling" (quantified as an increase of approximately 1000x in a year).
In V2, the thresholds are no longer defined by quantitative benchmarks. Anthropic now states that they will demonstrate that the model's capabilities are below these thresholds when necessary. However, this approach is susceptible to shifting goalposts as capabilities advance.
🔄 Commitment Changes: Dilution of mitigation strategies.
A similar trend is evident in their mitigation strategies. Instead of detailing specific measures, they focus on mitigation objectives, stating they will prove these objectives are met when required. This change alters the nature of their commitments.
💡 Key Point: Committing to robust measures and then diluting them significantly is not how genuine commitments are upheld.
The general direction of these changes is concerning. By allowing more leeway to decide if a model meets thresholds, Anthropic risks prioritizing scaling over safety, especially as competitive pressures intensify.
I was expecting the RSP to become more specific as technology advances and their risk management process matures, not the other way around.
@Drake Thomas are you interested in talking about other opportunities that might be better for the world than your current position (and meet other preferences of yours)? Or are you primarily interested in the "is my current position net positive or net negative for the world" question?
Does anyone know why Anthropic doesn't want models with powerful cyber capabilities to be classified as "dual-use foundation models?"
In its BIS comment, Anthropic proposes a new definition of dual-use foundation model that excludes cyberoffensive capabilities. This also comes up in TechNet's response (TechNet is a trade association that Anthropic is a part of).
Does anyone know why Anthropic doesn't want the cyber component of the definition to remain? (I don't think they cover this in the comment).
---
More details– the original criteria for "dual-use foundation model" proposed by BIS are:
(1) Substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear (CBRN) weapons;
(2) Enabling powerful offensive cyber operations through automated vulnerability discovery and exploitation against a wide range of potential targets of cyberattacks; or
(3) Permitting the evasion of human control or oversight through means of deception or obfuscation.
Anthropic's definition includes criteria #1 and #3 in its definition but excludes criterion #2.
(Separately, Anthropic argues that dual-use foundation models should be defined as those that pose catastrophic risks as opposed to serious risks to national security. This is important too, but I'm less confused about why Anthropic wants this.)
Where does he say this? (I skimmed and didn’t see it.)
Link here: https://www.astralcodexten.com/p/sb-1047-our-side-of-the-story
48 entities gave feedback on the Department of Commerce AI reporting requirements.
Public comments offering feedback on BIS's proposed reporting requirements are now up! It received responses from 48 entities including OpenAI, Anthropic, and many AI safety groups.
The reporting requirements are probably one of the most important things happening in US AI policy-- I'd encourage folks here to find time to skim some of the comments.
@ryan_greenblatt can you say more about what you mean by this one?
- We can't find countermeasures such that our control evaluations indicate any real safety.
I appreciate this distinction between the different types of outcomes-- nice.
I'm pretty interested in peoples' forecasts for each. So revised questions would be:
- Evals are trustworthy but say that the model is no longer safe (or rather something like "there is a non-trivial probability that the model may be dangerous/scheming."). Using your best judgment, at what point do you expect sabotage evaluations will reveal that a model might not be safe (we're not confident that it's not scheming)? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that the model is not scheming"? (Feel free to use heuristics like the OpenAI PF capability thresholds or ASL levels).
- Evals are no longer trustworthy and we notice. Using your best judgment, at what point do you expect sabotage evaluations will no longer be useful evals for determining whether or not a model is scheming? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that these evals are trustworthy/useful"?
- Evals are no longer trustworthy and we do not notice. Using your best judgment, how likely is it that sabotage evaluations are undermined such that humans think the model can be deployed safely, but actually the humans were misled? At what capability threshold would you worry about this form of undermining?
@David Duvenaud would be curious about your takes, even if they're just speculative guesses/predictions.
(And @ryan_greenblatt would be curious for your takes but replace "sabotage evaluations" with "control evaluations." In the event that you've written this up elsewhere, feel free to link it– helpful for me if you quote the particular section.)
I think I agree with this idea in principle, but I also feel like it misses some things in practice (or something). Some considerations:
- I think my bar for "how much I trust a lab such that I'm OK with them not making transparency commitments" is fairly high. I don't think any existing lab meets that bar.
- I feel like a lot of forms of helpful transparency are not that costly.
The main 'cost'EDIT: I think one of the most noteworthy costs is something like "maybe the government will end up regulating the sector if/when it understands how dangerous industry people expect AI systems to be and how many safety/security concerns they have". But I think things like "report dangerous stuff to the govt", "have a whistleblower mechanism", and even "make it clear that you're willing to have govt people come and ask about safety/security concerns" don't seem very costly from an immediate time/effort perspective. - If a Responsible Company implemented transparency stuff unilaterally, it would make it easier for the government to have proof-of-concept and implement the same requirements for other companies. In a lot of cases, showing that a concept works for company X (and that company X actually thinks it's a good thing) can reduce a lot of friction in getting things applied to companies Y and Z.
I do agree that some of this depends on the type of transparency commitment and there might be specific types of transparency commitments that don't make sense to pursue unilaterally. Off the top of my head, I can't think of any transparency requirements that I wouldn't want to see implemented unilaterally, and I can think of several that I would want to see (e.g., dangerous capability reports, capability forecasts, whistleblower mechanisms, sharing if-then plans with govt, sharing shutdown plans with govt, setting up interview program with govt, engaging publicly with threat models, having clear OpenAI-style tables that spell out which dangerous capabilities you're tracking/expecting).
Some ideas relating to comms/policy:
- Communicate your models of AI risk to policymakers
- Help policymakers understand emergency scenarios (especially misalignment scenarios) and how to prepare for them
- Use your lobbying/policy teams primarily to raise awareness about AGI and help policymakers prepare for potential AGI-related global security risks.
- Develop simple/clear frameworks that describe which dangerous capabilities you are tracking (I think OpenAI's preparedness framework is a good example, particularly RE simplicity/clarity/readability.)
- Advocate for increased transparency into frontier AI development through measures like stronger reporting requirements, whistleblower mechanisms, embedded auditors/resident inspectors, etc.
- Publicly discuss threat models (kudos to DeepMind)
- Engage in public discussions/debates with people like Hinton, Bengio, Hendrycks, Kokotajlo, etc.
- Encourage employees to engage in such discussions/debates, share their threat models, etc.
- Make capability forecasts public (predictions for when models would have XYZ capabilities)
- Communicate under what circumstances you think major government involvement would be necessary (e.g., nationalization, "CERN for AI" setups).
One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems. And they try to describe the capability threshold at which they suspect control will stop working (e.g., here).
For those working on sabotage evaluations: At what capability threshold do you think the sabotage/sandbagging evaluations will no longer work? (Or do you think that these sabotage evaluations + modified versions of them will scale until arbitrarily-capable systems?)
@Max Tegmark my impression is that you believe that some amount of cooperation between the US and China is possible. If the US takes steps that show that it is willing to avoid an AGI race, then there's some substantial probability that China will also want to avoid an AGI race. (And perhaps there could be some verification methods that support a "trust but verify" approach to international agreements.)
My main question: Are there circumstances under which you would no longer believe that cooperation is possible & you would instead find yourself advocating for an entente strategy?
When I look at the world right now, it seems to me like there's so much uncertainty around how governments will react to AGI that I think it's silly to throw out the idea of international coordination. As you mention in the post, there are also some signs that Chinese leaders and experts are concerned about AI risks.
It seems plausible to me that governments could– if they were sufficiently concerned about misalignment risks and believed in the assumptions behind calling an AGI race a "suicide race"– end up reaching cooperative agreements and pursuing some alternative to the "suicide race"
But suppose for sake of argument that there was compelling evidence that China was not willing to cooperate with the US. I don't mean the kind of evidence we have now, and I think we both probably agree that many actors will have incentives to say "there's no way China will cooperate with us" even in the absence of strong evidence. But if such evidence emerged, what do you think the best strategy would be from there? If hypothetically it became clear that China's leadership were essentially taking an e/acc approach and were really truly interested in getting AGI ~as quickly as possible, what do you think should be done?
I ask partially because I'm trying to think more clearly about these topics myself. I think my current viewpoint is something like:
- In general, the US should avoid taking actions that make a race with China more likely or inevitable.
- The primary plan should be for the US to engage in good-faith efforts to pursue international coordination, aiming toward a world where there are verifiable ways to avoid the premature development of AGI.
- We could end up in a scenario in which the prospect of international coordination has fallen apart. (e.g., China or some other major US adversary adopts a very "e/acc mindset" and seems to be gunning toward AGI with safety plans that are considerably worse than those proposed by the US.) At this point, it seems to me like the US would either have to (a) try to get to AGI before the adversary [essentially the Entente plan] or (b) give up and just kinda hope that the adversary ends up changing course as they get closer to AGI. Let's call this "world #3"[1].
Again, I think a lot of folks will have strong incentives to try to paint us as being in world #3, and I personally don't think we have enough evidence to say "yup, we're so confident we're in world #3 that we should go with an entente strategy.". But I'm curious if you've thought about the conditions under which you'd conclude that we are quite confidently in world #3 and what you think we should do from there.
- ^
I sometimes think about the following situations:
World #1: Status quo; governments are not sufficiently concerned; corporations race to develop AGI
World #2: Governments become quite concerned about AGI and pursue international coordination
World #3: Governments become quite concerned about AGI but there is strong evidence that at least one major world power is refusing to cooperate//gunning toward AGI.
I largely agree with this take & also think that people often aren't aware of some of GDM's bright spots from a safety perspective. My guess is that most people overestimate the degree to which ANT>GDM from a safety perspective.
For example, I think GDM has been thinking more about international coordination than ANT. Demis has said that he supports a "CERN for AI" model, and GDM's governance team (led by Allan Dafoe) has written a few pieces about international coordination proposals.
ANT has said very little about international coordination. It's much harder to get a sense of where ANT's policy team is at. My guess is that they are less enthusiastic about international coordination relative to GDM and more enthusiastic about things like RSPs, safety cases, and letting scaling labs continue unless/until there is clearer empirical evidence of loss of control risks.
I also think GDM deserves some praise for engaging publicly with arguments about AGI ruin and threat models.
(On the other hand, GDM is ultimately controlled by Google, which makes it unclear how important Demis's opinions or Allan's work will be. Also, my impression is that Google was neutral or against SB1047, whereas ANT eventually said that the benefits outweighed the costs.)
What do you think was underrated about it? I think when I read it I have some sort of "yeah, this makes sense" reaction but am not "wow'd" by it.
It seems like the deeper challenge is figuring out how to align incentives. Can we find a structure where labs want to EG give white-box access to a bunch of external researchers and give them a long time to red-team models while somehow also maintaining the independence of the white-box auditors? How do you avoid industry capture?
Same kinds of challenges come up with safety research– how do you give labs the incentive to publish safety research that makes their product or their approach look bad? How do you avoid publication bias and phacking-type concerns?
I don't think your post is obligated to get into those concerns, but perhaps a post that grappled with those concerns would be something I'd be "wow'd" by, if that makes sense.
It's quite hard to summarize AI governance in a few readings. With that in mind, here are some AI governance ideas/concepts/frames that I would add:
Emergency Preparedness (Wasil et al; exec summary + policy proposals - 3 mins)
Governments should invest in strategies that can help them detect and prepare for time-sensitive AI risks. Governments should have ways to detect threats that would require immediate intervention & have preparedness plans for how they can effectively respond to various acute risk scenarios.
- Safety cases (Irving - 3 mins; see also Clymer et al)
Labs should present arguments that AI systems are safe within a particular training or deployment context.
(Others that I don't have time to summarize but still want to include:)
- Policy ideas for mitigating AI risk (Larsen)
- Hardware-enabled governance mechanisms (Kulp et al)
- Verification methods for international AI agreements (Wasil et al)
- Situational awareness (Aschenbrenner)
- A Narrow Path (Miotti et al)
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
Adding my two cents as someone who has a pretty different lens from Habryka but has still been fairly disappointed with OpenPhil, especially in the policy domain.
Relative to Habryka, I am generally more OK with people "playing politics". I think it's probably good for AI safely folks to exhibit socially-common levels of "playing the game"– networking, finding common ground, avoiding offending other people, etc. I think some people in the rationalist sphere have a very strong aversion to some things in this genre, and labels like "power-seeking" and "deceptive" get thrown around too liberally. I also think I'm pretty with OpenPhil deciding it doesn't want to fund certain parts of the rationalist ecosystem (and probably less bothered than Habryka about how their comms around this wasn't direct/clear).
In that sense, I don't penalize OP much for trying to "play politics" or for breaking deontological norms. Nonetheless, I still feel pretty disappointed with them, particularly for their impact on comms/policy. Some thoughts here:
- I agree with Habryka that it is quite bad that OP is not willing to fund right-coded things. Even many of the "bipartisan" things funded by OP are quite left-coded. (As a useful heuristic, whenever you hear of someone launching a bipartisan initiative, I think one should ask "what % of the staff of this organization is Republican?" Obviously just a heuristic– there are some cases in which a 90%-Dem staff can actually truly engage in "real" bipartisan efforts. But in some cases, you will have a 90%-Dem staff claiming to be interested in bipartisan work without any real interest in Republican ideas, few if any Republican contacts, and only a cursory understanding of Republican stances.)
- I also agree with Habryka that OP seems overly focused on PR risks and not doing things that are weird/controversial. "To be a longtermist grantee these days you have to be the kind of person that OP thinks is not and will not be a PR risk, IE will not say weird or controversial stuff" sounds pretty accurate to me. OP cannot publicly admit this because this would be bad for its reputation– instead, it operates more subtly.
- Separately, I have seen OpenPhil attempt to block or restrain multiple efforts in which people were trying to straightforwardly explain AI risks to policymakers. My understanding is that OpenPhil would say that they believed the messengers weren't the right people (e.g., too inexperienced), and they thought the downside risks were too high. In practice, there are some real tradeoffs here: there are often people who seem to have strong models of AGI risk but little/no policy experience, and sometimes people who have extensive policy experience but only recently started engaging with AI/AGI issues. With that in mind, I think OpenPhil has systematically made poor tradeoffs here and failed to invest into (or in some cases, actively blocked) people who were willing to be explicit about AGI risks, loss of control risks, capability progress, and the need for regulation. (I also think the "actively blocking" thing has gotten less severe over time, perhaps in part because OpenPhil changed its mind a bit on the value of direct advocacy or perhaps because OpenPhil just decided to focus its efforts on things like research and advocacy projects found funding elsewhere.)
- I think OpenPhil has an intellectual monoculture and puts explicit/implicit cultural pressure on people in the OP orbit to "stay in line." There is a lot of talk about valuing people who can think for themselves, but I think the groupthink problems are pretty real. There is a strong emphasis on "checking-in" with people before saying/doing things, and the OP bubble is generally much more willing to criticize action than inaction. I suspect that something like the CAIS statement or even a lot of the early Bengio comms would not have occured if Dan Hendrycks or Yoshua were deeply ingrained in the OP orbit. It is both the case that they would've been forced to write 10+ page Google Docs defending their theories of change and the case that the intellectual culture simply wouldn't have fostered this kind of thinking.
- I think the focus on evals/RSPs can largely be explained by a bias toward trusting labs. OpenPhil steered a lot of talent toward the evals/RSPs theory of change (specifically, if I recall correctly, OpenPhil leadership on AI was especially influential in steering a lot of the ecosystem to support and invest in the evals/RSPs theory of change.) I expect that when we look back in a few years, there will be a pretty strong feeling that this was the wrong call & that this should've been more apparent even without the benefit of hindsight.
- I would be more sympathetic to OpenPhil in a world where their aversion to weirdness/PR risks resulted in them having a strong reputation, a lot of political capital, and real-world influence that matched the financial resources they possess. Sadly, I think we're in a "lose-lose" world: OpenPhil's reputation tends to be poor in many policy/journalism circles even while OpenPhil pursues a strategy that seems to be largely focused on avoiding PR risks. I think some of this is unjustified (e.g., a result of propaganda campaigns designed to paint anyone who cares about AI risk as awful). But then some of it actually is kind of reasonable (e.g., impartial observers viewing OpenPhil as kind of shady, not direct in its communications, not very willing to engage directly or openly with policymakers or journalists, having lots of conflicts of interests, trying to underplay the extent to which its funding priorities are influenced/constrained by a single Billionaire, being pretty left-coded, etc.)
To defend OpenPhil a bit, I do think it's quite hard to navigate trade-offs and I think sometimes people don't seem to recognize these tradeoffs. In AI policy, I think the biggest tradeoff is something like "lots of people who have engaged with technical AGI arguments and AGI threat models don't have policy experience, and lots of people who have policy experience don't have technical expertise or experience engaging with AGI threat models" (this is a bit of an oversimplification– there are some shining stars who have both.)
I also think OpenPhil folks probably tend to have a different probability distribution over threat models (compared to me and probably also Habryka). For instance, it seems likely to me that OpenPhil employees operate in more of a "there are a lot of ways AGI could play out and a lot of uncertainty– we just need smart people thinking seriously about the problem. And who really know how hard alignment will be, maybe Anthropic will just figure it out" lens and less of a "ASI is coming and our priority needs to be making sure humanity understands the dangers associated with a reckless race toward ASI, and there's a substantial chance that we are seriously not on track to solve the necessary safety and security challenges unless we fundamentally reorient our whole approach" lens.
And finally, I think despite these criticisms, OpenPhil is also responsible for some important wins (e.g., building the field, raising awareness about AGI risk on university campuses, funding some people early on before AI safety was a "big deal", jumpstarting the careers of some leaders in the policy space [especially in the UK]. It's also plausible to me that there are some cases in which OpenPhil gatekeeping was actually quite useful in preventing people from causing harm, even though I probably disagree with OpenPhil about the # and magnitude of these cases).
Today we have enough bits to have a pretty good guess when where and how superintelligence will happen.
@Alexander Gietelink Oldenziel can you expand on this? What's your current model of when/where/how superintelligence will happen?
Recent Senate hearing includes testimony from Helen Toner and William Saunders.
- Both statements are explicit about AGI risks & emphasize the importance of transparency & whistleblower mechanisms.
- William's statement acknowledges that he and others doubt that OpenAI's safety work will be sufficient.
- "OpenAI will say that they are improving. I and other employees who resigned doubt they will be ready in time. This is true not just with OpenAI; the incentives to prioritize rapid development apply to the entire industry. This is why a policy response is needed."
- Helen's statement provides an interesting paragraph about China at the end.
- "A closing note on China: The specter of ceding U.S. technological leadership to China is often treated as a knock-down argument against implementing regulations of any kind. Based on my research on the Chinese AI ecosystem and U.S.-China technology competition more broadly, I think this argument is not nearly as strong as it seems at first glance. We should certainly be mindful of how regulation can affect the pace of innovation at home, and keep a close eye on how our competitors and adversaries are developing and using AI. But looking in depth at Chinese AI development, the AI regulations they are already imposing, and the macro headwinds they face leaves me with the conclusion that they are far from being poised to overtake the United States.6 The fact that targeted, adaptive regulation does not have to slow down U.S. innovation—and in fact can actively support it—only strengthens this point."
Full hearing here (I haven't watched it yet.)
Hm, good question. I think it should be proportional to the amount of time it would take to investigate the concern(s).
For this, I think 1-2 weeks seems reasonable, at least for an initial response.