Posts
Comments
I think you might first reach wildly superhuman AI via scaling up some sort of machine learning (and most of that is something well described as deep learning). Note that I said "needed". So, I would also count it as acceptable to build the AI with deep learning to allow for current tools to be applied even if something else would be more competitive.
(Note that I was responding to "between now and superintelligence", not claiming that this would generalize to all superintelligences built in the future.)
I agree that literal jupiter brains will very likely be built using something totally different than machine learning.
"fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks"
Suppose we replace "AIs" with "aliens" (or even, some other group of humans). Do you agree that doesn't (necessarily) kill you due to slop if you don't have a full solution to the superintelligence alignment problem?
I would say it is basically-always true, but there are some fields (including deep learning today, for purposes of your comment) where the big hard central problems have already been solved, and therefore the many small pieces of progress on subproblems are all of what remains.
Maybe, but it is interesting to note that:
- A majority of productive work is occuring on small subproblems even if some previous paradigm change was required for this.
- For many fields, (e.g., deep learning) many people didn't recognize (and potentially still don't recognize!) that the big hard central problem was already solved. This potentially implies it might be non-obvious whether this has been solved and making bets on some existing paradigm which doesn't obviously suffice can be reasonable.
Things feel more continuous to me than your model suggests.
And insofar as there remains some problem which is simply not solvable within a certain paradigm, that is a "big hard central problem", and progress on the smaller subproblems of the current paradigm is unlikely by-default to generalize to whatever new paradigm solves that big hard central problem.
It doesn't seem clear to me this is true in AI safety at all, at least for non-worst-case AI safety.
I claim it is extremely obvious and very overdetermined that this will occur in AI safety sometime between now and superintelligence.
Yes, I added "prior to human obsolescence" (which is what I meant).
Depending on what you mean by "superintelligence", this isn't at all obvious to me. It's not clear to me we'll have (or will "need") new paradigms before fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks. Doing this hand over doesn't directly require understanding whether the AI is specifically producing alignment work that generalizes. For instance, the system might pursue routes other than alignment work and we might determine its judgement/taste/epistimics/etc are good enough based on examining things other than alignment research intended to generalize beyond the current paradigm.
If by superintelligence, you mean wildly superhuman AI, it remains non-obvious to me that new paradigms are needed (though I agree they will pretty likely arise prior to this point due to AIs doing vast quantity of research if nothing else). I think thoughtful and laborious implementation of current paradigm strategies (including substantial experimentation) could directly reduce risk from handing off to superintelligence down to perhaps 25% and I could imagine being argued considerably lower.
This post seems to assume that research fields have big, hard central problems that are solved with some specific technique or paradigm.
This isn't always true. Many fields have the property that most of the work is on making small components work slightly better in ways that are very interoperable and don't have complex interactions. For instance, consider the case of making AIs more capable in the current paradigm. There are many different subcomponents which are mostly independent and interact mostly multiplicatively:
- Better training data: This is extremely independent: finding some source of better data or better data filtering can be basically arbitrarily combined with other work on constructing better training data. That's not to say this parallelizes perfectly (given that work on filtering or curation might obsolete some prior piece of work), but just to say that marginal work can often just myopically improve performance.
- Better architectures: This breaks down into a large number of mostly independent categories that typically don't interact non-trivially:
- All of attention, MLPs, and positional embeddings can be worked on independently.
- A bunch of hyperparameters can be better understood in parallel
- Better optimizers and regularization (often insights within a given optimizer like AdamW can be mixed into other optimizers)
- Often larger scale changes (e.g., mamba) can incorporate many or most components from prior architectures.
- Better optimized kernels / code
- Better hardware
Other examples of fields like this include: medicine, mechanical engineering, education, SAT solving, and computer chess.
I agree that paradigm shifts can invalidate large amounts of prior work (and this has occurred at some point in each of the fields I list above), but it isn't obvious whether this will occur in AI safety prior to human obsolescence. In many fields, this doesn't occur very often.
Ok, I think what is going on here is maybe that the constant you're discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you're talking about takes into account compute bottlenecks and similar?
Not totally sure.
Lower lambda. I'd now use more like lambda = 0.4 as my median. There's really not much evidence pinning this down; I think Tamay Besiroglu thinks there's some evidence for values as low as 0.2.
Isn't this really implausible? This implies that if you had 1000 researchers/engineers of average skill at OpenAI doing AI R&D, this would be as good as having one average skill researcher running at 16x () speed. It does seem very slightly plausible that having someone as good as the best researcher/engineer at OpenAI run at 16x speed would be competitive with OpenAI, but that isn't what this term is computing. 0.2 is even more crazy, implying that 1000 researchers/engineers is as good as one researcher/engineer running at 4x speed!
I agree, but it is important to note that the authors of the paper disagree here.
(It's somewhat hard for me to tell if the crux is more that they don't expect that everyone would get AI aligned to them (at least as representatives) even if this was technical feasible with zero alignment tax or if the crux is that even if everyone had single-single aligned corrigible AIs representing their interests and with control over their assets and power that would still result in disempowerment. I think it is more like second thing here.)
So Zvi is accurately representing the perspective of the authors, I just disagree with them.
Yes, but random people can't comment or post on the alignment forum and in practice I find that lots of AI relevant stuff doesn't make it there (and the frontpage is generally worse than my lesswrong frontpage after misc tweaking).
TBC, I'm not really trying to make a case that something should happen here, just trying to quickly articulate why I don't think the alignment forum fully addresses what I want.
[Mildly off topic]
I think it would be better if the default was that LW is a site about AI and longtermist cause areas and other stuff was hidden by default. (Or other stuff is shown by default on rat.lesswrong.com or whatever.)
Correspondingly, I wouldn't like penalizing multiple of the same tag.
I think the non-AI stuff is less good for the world than the AI stuff and there are downsides in having LW feature non-AI stuff from my perspective (e.g., it's more weird and random from the perspective of relevant people).
I don't disagree. I assumed Raemon intended something more elaborate than just a salient button with this effect.
Fair. For reference, here are my selections which are I think are a good default strategy for people who just come to LW for AI/AI safety reasons:
(Why "-" a bunch of stuff rather than "+" AI? Well, because "+" adds some fixed karma while "-" multiplies by a constant (less than 1), and I don't think adding karma is a good strategy (as it shows you really random posts often). I do like minus-ing these specific things also. If I could, I'd probably do AI * 1.5 and then reduce the negative weight on these things a bit.)
So this might be a good suggestion for a "AI focus" preset.
I don't understand. Can't people just hide posts tagged as AI?
Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.
Yes, it is relevant for untrusted monitoring. I think the basic version of this is pretty resolvable via:
- Paraphrasing (such that exact "I generated this text" doesn't work). Note that we just need to paraphase stuff sent to the monitor, not the actual action / reasoning.
- Using a different fine-tune for the untrusted monitor so it doesn't have the exact same probability distribution.
We discuss in more detail here.
I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?
I think something like this is a reasonable model but I have a few things I'd change.
Whichever group has more power at the end of the week survives.
Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.)
Probably I'd prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies "A" and "B" where "A" just wants AI systems decended from them to have power and "B" wants to maximize the expected resources under control of humans in B. We'll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does "B" end up with substantially less expected power? To make this more realistic (as might be important), we'll say that "B" has a random lead/disadvantage uniformly distributed between (e.g.) -3 and 3 months so that winner takes all dynamics aren't a crux.
The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.
What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation?
Supposing you're fine with these changes, then my claim would be:
- If alignment is solved, then the AI representing B can powerseek in exactly the same way as the AI representing A does while still defering to the humans on the long run resource usage and still devoting a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful). Thus, the cost for B is negligable and B barely loses any power relative to its initial position. If it is winner takes all, B has almost a 50% chance of winning.
- If alignment isn't solved, the stategy for B will involve spending a subset of resources on trying to solve alignment. I think alignment is reasonably likely to be practically feasible such that by spending a month of delay to work specifically on safety/alignment (over what A does for commercial reasons) might get B a 50% chance of solving alignment or ending up in a (successful) basin where AIs are trying to actively retain human power / align themselves better. (A substantial fraction of this is via defering to some AI system of dubious trustworthiness because you're in a huge rush. Yes, the AI systems might fail to align their successors, but this still seems like a one time hair cut from my perspective.) So, if it is winner takes all, (naively) B wins in 2 / 6 * 1 / 2 = 1 / 6 of worlds which is 3x worse than the original 50% baseline. (2 / 6 is because they delay for a month.) But, the issue I'm imagining here wasn't gradual disempowerment! The issue was that B failed to align their AIs and people at A didn't care at all about retaining control. (If people at A did care, then coordination is in principle possible, but might not work.)
I think a crux is that you think there is a perpetual alignment tax while I think a one time tax gets you somewhere.
At a more basic level, when I think about what goes wrong in these worlds, it doesn't seem very likely to be well described as gradual disempowerment? (In the sense described in the paper.) The existance of an alignment tax doesn't imply gradual disempowerment. A scenario I find more plausible is that you get value drift (unless you pay a long lasting alignment tax that is substantial), but I don't think the actual problem will be well described as gradual disempowerment in the sense described in the paper.
(I don't think I should engage more on gradual disempowerment for the time being unless somewhat wants to bid for this or trade favors for this or similar. Sorry.)
I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.
Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?
- Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.
I wasn't saying people would ask for advice instead of letting AIs run organizations, I was saying they would ask for advice at all. (In fact, if the AI is single-single aligned to them in a real sense and very capable, it's even better to let that AI make all decisions on your behalf than to get advice. I was saying that even if no one bothers to have a single-single aligned AI representative, they could still ask AIs for advice and unless these AIs are straightforwardly misaligned in this context (e.g., they intentionally give bad advice or don't try at all without making this clear) they'd get useful advice for their own empowerment.)
- The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.
I'm claiming that it will selfishly (in terms of personal power) be in their interests to not have such a governance structure and instead have a governance structure which actually increases or retains their personal power. My argument here isn't about coordination. It's that I expect individual powerseeking to suffice for individuals not losing their power.
I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources in the long run?
This isn't at all an air tight argument to be clear, you can in principle have an equilibrium where if everyone powerseeks (without coordination) everyone gets negligable resources due to negative externalities (that result in some other non-human entity getting power) even if technical misalignment is solved. I just don't see a very plausible case for this and I don't think the paper makes this case.
Handing off decision making to AIs is fine---the question is who ultimately gets to spend the profits.
If your claim is "insufficient cooperation and coordination will result in racing to build and hand over power to AIs which will yield bad outcomes due to misaligned AI powerseeking, human power grabs, usage of WMDs (e.g., extreme proliferation of bioweapons yielding an equilibrium where bioweapon usage is likely), and extreme environmental negative externalities due to explosive industrialization (e.g., literally boiling earth's oceans)" then all of these seem at least somewhat plausible to me, but these aren't the threat models described in the paper and of this list only misaligned AI powerseeking seems like it would very plausibly result in total human disempowerment.
More minimally, the mitigations discussed in the paper mostly wouldn't help with these threat models IMO.
(I'm skeptical of insufficient coordination by the time industry is literally boiling the oceans on earth. I also don't think usage of bioweapons is likely to cause total human disempowerment except in combination with misaligned AI takeover---why would it kill literally all humans? TBC, I think >50% of people dying during the singularity due to conflict (between humans or with misaligned AIs) is pretty plausible even without misalignment concerns and this is obviously very bad, but it wouldn't yield total human disempowerment.)
I do agree that there are problems other than AI misalignment including that the default distribution of power might be problematic, people might not carefully contemplate what to do with vast cosmic resources (and thus use them poorly), people might go crazy due to super persuation or other cultural forces, society might generally have poor epistemics due to training AIs to have poor epistemics or insufficiently defering to AIs, and many people might die in conflict due to very rapid tech progress.
The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity with 1e20 FLOPs, however you vary the number of epochs and active params!
I'm currently skeptical and more minimally, I don't understand the argument you're making. Probably not worth getting into.
I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don't see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument.
Regardless, I don't think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)
It did OK at control.
So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.
I agree compute optimal MoEs don't improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data.
As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute.
Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)
This is overall output throughput not latency (which would be output tokens per second for a single context).
a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second. This throughput
This just claims that you can run a bunch of parallel instances of R1.
Discussion in this other comment is relevant:
Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.
However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario's (Dario Amodei is the CEO of Anthropic), in my earlier post.
...
importantly, in the dictators handbook case, some humans do actually get the power.
Yes, that is an accurate understanding.
(Not sure what you mean by "This would explain Sonnet's reaction below!")
- I think it's much easier to RL on huge numbers of math problems, including because it is easier to verify and because you can more easily get many problems. Also, for random reasons, doing single turn RL is substantially less complex and maybe faster than multi turn RL on agency (due to variable number of steps and variable delay from environments)
- OpenAI probably hasn't gotten around to doing as much computer use RL partially due to prioritization.
Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)
I was mostly just trying to point to prior arguments against similar arguments while expressing my view.
(Note that I don't work for Anthropic.)
This is now out.
I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).
if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single alignment:
- At the most basic level, I expect we'll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we'll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I've very skeptical of a world where single-single alignment is well described as being solved and people don't ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
- I expect organizations will be explicitly controlled by people and (some of) those people will have AI representatives to represent their interests as I discuss here. If you think getting good AI representation is unlikely, that would be a crux, but this would be my proposed solution at least.
- The explicit mission of for-profit companies is to empower the shareholders. It clearly doesn't serve the interests of the shareholders to end up dead or disempowered.
- Democratic governments have similar properties.
- At a more basic level, I think people running organizations won't decide "oh, we should put the AI in charge of running this organization aligned to some mission from the preferences of the people (like me) who currently have de facto or de jure power over this organization". This is a crazily disempowering move that I expect people will by default be too savvy to make in almost all cases. (Both for people with substantial de facto and with de jure power.)
- Even independent of the advice consideration, people will probably want AIs running organizations to be honest to at least the people controlling the organization. Given that I expect explicit control by people in almost all cases, if things are going in an existential direction, people can vote to change them in almost all cases.
- I don't buy that there will be some sort of existential multi-polar trap even without coordination (though I also expect coordintion) due to things like the strategy stealing assumption as I also discuss in that comment.
- If a subset of organizations diverge from a reasonable interpretation of what they were supposed to do (but are still basically obeying the law and some interpretation of what they were intentionally aligned to) and this is clear to the rest of the world (as I expect would be the case given some type of AI advisors), then the rest of the world can avoid problems from this subset via the court system or other mechanisms. Even if this subset of organizations run by effectively rogue AIs just runs away with resources successfully, this is probably only a subset of resources.
I think your response to a lot of this will be something like:
- People won't have or won't listen to AI advisors.
- Institutions will intentionally delude relevant people to acquire more power.
But, the key thing is that I expect at least some people will keep power, even if large subsets are deluded. E.g., I expect that corporate shareholders, boardmembers, or government will be very interested in the question of whether they will be disempowered by changes in structure. It does seem plausible (or even likely) to me that some people will engage in power grabs via ensuring AIs are aligned to them, deluding the world about what's going on using a variety of mechanisms (including, e.g., denying or manipulating access to AI representation/advisors), and expanding their (hard) power over time. The thing I don't buy is a case where very powerful people don't ask for advice at all prior to having been deluded by the organizations that they themselves run!
I think human power grabs like this are concerning and there are a variety of plausible solutions which seem somewhat reasonable.
Maybe your response is that the solutions that will be implemented in practice given concerns about human power grabs will involve aligning AIs to institutions in ways that yield the dynamics you describe? I'm skeptical given the dynamics discussed above about asking AIs for advice.
I've updated over time to thinking that trusting AI systems and mostly handing things off to AI systems is an important dynamic to be tracking. As in, the notion of human researcher obsolescence discussed here. In particular, it seems important to understand what is required for this and when we should make this transition.
I previously thought that this handoff should be mostly beyond our planning horizon (for more prosaic researchers) because by the time of this handoff we will have already spent many years being radically (e.g., >10x) accelerated by AI systems, but I've updated against this due to updating toward shorter timelines, faster takeoff, and lower effort on safety (indicating less pausing). I also am just more into end-to-end planning with more of the later details figured out.
As I previously discussed here, I still think that you can do this with Paul-corrigible AGIs, including AGIs that you wouldn't be happy trusting with building a utopia (at least building a utopia without letting you reflect and then consulting you about this). However, these AIs will be nearly totally uncontrolled in that you'll be deferring to them nearly entirely. But, the aim is that they'll ultimately give you the power they acquire back etc. I also think pretty straightforward training strategies could very plausibly work (at least if you develop countermeasures to relatively specific threat models), like studying and depending on generalization.
(I've also updated some toward various of the arguments in this post about vulnerability and first mover advantage, but I still disagree with the bottom line overall..)
Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)
If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.
I don't have a ready to go list. You might be interested in this post and comments responding to it, though I'd note I disagree substantially with the post.
Build concrete evidence of risk, to increase political will towards reducing misalignment risk
Working on demonstrating evidence of risk at an very irresponsible developer might be worse than you'd hope because they prevent you publishing or try to prevent this type of research from happening in the first place (given that it might be bad for their interests).
However, it's also possible they'd be skeptical of research done elsewhere.
I'm not sure how to reconcile these considerations.
I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".
My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.
Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:
- Humans own capital/influence.
- They use this influence to serve their own interests and have an (aligned) AI system which faithfully represents their interests.
- Given that AIs can make high quality representation very cheap, the AI representation is very good and granular. Thus, something like the strategy-stealing assumption can hold and we might expect that humans end up with the same expected fraction of captial/influence they started with (at least to the extent they are interested in saving rather than consumption).
Even without any coordination, this can potentially work OK. There are objections to the strategy-stealing assumption, but none of these seem existential if we get to a point where everyone has wildly superintelligent and aligned AI representatives and we've ensured humans are physically robust to offense dominant technologies like bioweapons.
(I'm optimistic about being robust to bioweapons within a year or two of having wildly superhuman AIs, though we might run into huge issues during this transitional period... Regardless, bioweapons deployed by terrorists or as part of a power grab in a brief transitional period doesn't seem like the threat model you're describing.)
I expect some issues with races-to-the-bottom / negative sum dynamics / negative externalities like:
- By default, increased industry on earth shortly after the creation of very powerful AI will result in boiling the oceans (via fusion power). If you don't participate in this industry, you might be substantially outcompeted by others[1]. However, I don't think it will be that expensive to protect humans through this period, especially if you're willing to use strategies like converting people into emulated minds. Thus, this doesn't seem at all likely to be literally existential. (I'm also optimistic about coordination here.)
- There might be one time shifts in power between humans via mechanisms like states becoming more powerful. But, ultimately these states will be controlled by humans or appointed successors of humans if alignment isn't an issue. Mechanisms like competing over the quantity of bribery are zero sum as they just change the distribution of power and this can be priced in as a one time shift even without coordination to race to the bottom on bribes.
But, this still doesn't seem to cause issues with humans retaining control via their AI representatives? Perhaps the distribution of power between humans is problematic and may be extremely unequal and the biosphere will physically be mostly destroyed (though humans will survive), but I thought you were making stronger claims.
Edit in response to your edit: If we align the AI to some arbitrary target which is seriously misaligned with humanity as a whole (due to infighting or other issues), I agree this can cause existential problems.
(I think I should read the paper in more detail before engaging more than this!)
It's unclear if boiling the oceans would result in substantial acceleration. This depends on how quickly you can develop industry in space and dyson sphere style structures. I'd guess the speed up is much less than a year. ↩︎
The paper says:
Christiano (2019) makes the case that sudden disempowerment is unlikely,
This isn't accurate. The post What failure looks like includes a scenario involving sudden disempowerment!
The post does say:
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like,
But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the level of capabilities take its creators and others by suprise and fast capabilities progress allows for AIs to suddenly become poweful enough to takeover) rather than threat models involving sudden disempowerment from a point where AIs are already well known to be extremely powerful.
I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).
This has previously been argued by Vanessa here and Paul here in response to a post making a similar claim.
I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However, I don't think this matches the story you describe and the mitigations seem substantially different than what you seem to be imagining.
I'm also somewhat skeptical of the threat model you describe in the case where alignment isn't solved. I think the difference between the story you tell and something more like We get what we measure is important.
I worry I'm misunderstanding something because I haven't read the paper in detail.
It sounds like you entirely agree with the logic of the post, but wish that the start of the post mentioned something like what it says at the end:
Humans may live comfortably after the development of AGI, not due to high wages but from other income sources like investments, government welfare, and charity. The latter two sources seem especially promising if AI alignment ensures ongoing support for human well-being.
And perhaps you also wish that the post had an optimistic rather than neutral tone. (And that it more clearly emphasized that AGI would result in large amounts of wealth---at least for someone.)
Given this, it seems your claim is just that you're quite optimistic about some mixture of "investments, government welfare, and charity" coming through to keep people alive and happy. (Given the assumption that (some) humans maintain control, there isn't some other similar catastrophe, and the world is basically as it seems e.g. we're not in a simulation.)
Correspondingly, it seems unaccurate to say:
Jobs and technology have a purpose: producing the good and services we need to live and thrive. If your model of the world includes the possibility that we would create the most advanced technology the world has ever seen, and the result would be mass starvation, then I think your model is fundamentally flawed.
and
it does so based on a combination of flawed logic and
If someone didn't have wealth (perhaps due to expropriation) and no one gave them anything out of generosity, then because their labor has no value they could starve and I think you agree? I don't think the statement "Jobs and technology have a purpose" means anything.
Historically, even if no one who owned capital or had power cared at all about the welfare of someone without capital, those with capital would often still be selfishly interested in employing that person. So, even if someone had all their assests taken from them, it would still selfishly make sense for a variety of people to trade with them such that they can survive. In cases where this is less true, human welfare often seems to suffer (see also resource curse). As you seemingly agree, with sufficient technological development, labor may become unimportant and if a person doesn't have capital, no one with capital would selfishly be interested in trading with them such that they can remain alive.
In practice, extremely minimal amounts of charity could suffice (e.g., one person could pay to feed the entire world), so I'm personally optimistic about avoiding literal starvation. However, I do worry about the balance of power in a world where human labor is unimportant: democracy and other institutions seems potentially less stable in such a world.
(As far as concerns with institutions and power, see discussion in this recent paper about gradual disempowerment, though note I think I mostly disagree with this paper.)
It doesn't seem very useful for them IMO.
My bet is that this isn't an attempt to erode alignment, but is instead based on thinking that lumping together intentional bad actions with mistakes is a reasonable starting point for building safeguards. Then the report doesn't distinguish between these due to a communication failure. It could also just be a more generic communication failure.
(I don't know if I agree that lumping these together is a good starting point to start experimenting with safeguards, but it doesn't seem crazy.)
15x compute multiplier relative to what? See also here.
I think Zac is trying to say they left not to protest, but instead because they didn't think staying was viable (for whatever research and/or implementation they wanted to do).
On my views (not Zac's), "staying wouldn't be viable for someone who was willing to work in a potentially pretty unpleasant work environment and focus on implementation (and currently prepping for this implementation)" doesn't seem like an accurate description of the situation. (See also Buck's comment here.)
Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.
As in, you don't expect they'll be able to implement stuff even if it doesn't make anyone's workflow harder or you don't expect they'll be able to get that much compute?
Naively, we might expect ~1% of compute as we might expect around 1000 researchers and 10/1000 is 1%. Buck said 3% because I argued for increasing this number. My case would be that there will be bunch of cases where the thing they want to do is obviously reasonable and potentially justifiable from multiple perspectives (do some monitoring of internal usage, fine-tune a model for forecasting/advice, use models to do safety research) such that they can pull somewhat more compute than just the head count would suggest.
Yes. But also, I'm afraid that Anthropic might solve this problem by just making less statements (which seems bad).
Making more statements would also be fine! I wouldn't mind if there were just clarifying statements even if the original statement had some problems.
(To try to reduce the incentive for less statements, I criticized other labs for not having policies at all.)
I think I roughly stand behind my perspective in this dialogue. I feel somewhat more cynical than I did at the time I did this dialogue, perhaps partially due to actual updates from the world and partially because I was trying to argue for the optimistic case here which put me in a somewhat different frame.
Here are some ways my perspective differs now:
- I wish I said something like: "AI companies probably won't actually pause unilaterally, so the hope for voluntary RSPs has to be building consensus or helping to motivate developing countermeasures". I don't think I would have disagreed with this statement in the past, or at least I wouldn't have fully disagreed with it and it seems like important context.
- I think in practice, we're unlikely to end up with specific tests that are defined in advance and aren't goodhartable or cheatable. I do think that control could in principle be defined in advance and hard to goodhart using external evaluation, but I don't expect companies to commit to specific tests which are hard to goodhart/cheat. They could make procedural commitments for third party review which are hard to cheat. Something like "this third party will review the available evidence (including our safety report and all applicable internal knowledge) and then make a public statement about the level of risk and whether there is important information which should be disclosed to the public" (I could outline this proposal in more detail, it's mostly not my original idea.)
- I'm somewhat more interested in companies focusing on things other than safety cases and commitments. Either trying to get evidence of risk that might be convincing to others (in worlds where these risks are large) or working on at-the-margin safety interventions from a cost benefit perspective.
I think the post could directly say "voluntary RSPs seem unlikely to suffice (and wouldn't be pauses done right), but ...".
I agree it does emphasize the importance of regulation pretty strongly.
Part of my perspective is that the title implies a conclusion which isn't quite right and so it would have been good (at least with the benefit of hindsight) to clarify this explicitly. At least to the extent you agree with me.
This post seems mostly reasonable in retrospect, except that it doesn't specifically note that it seems unlikely that voluntary RSP commitments would result in AI companies unilaterally pausing until they were able to achieve broadly reasonable levels of safety. I wish the post more strongly emphasized that regulation was a key part of the picture---my view is that "voluntary RSPs are pauses done right" is wrong, but "RSPs via (international) regulation are pauses done right" seems like it could be roughly right. That said, I do think that purely voluntary RSPs are pretty reasonable and useful, at least if the relevant company is transparent about when they would proceed despite being unable to achieve a reasonable level of safety.
As of now at the start of 2025, I think we know more information that makes this plan looks worse.[1] I don't see a likely path to ensuring 80% of companies have a reasonble RSP in short timelines. (For instance, not even Anthropic has expanded their RSP to include ASL-4 requirements about 1.5 years after the RSP came out.) And, beyond this, I think the current regulatory climate is such that we might not get RSPs enforced in durable regulation[2] applying to at least US companies in short timelines even if 80% of companies had good RSPs.
I edited to add the first sentence of this paragraph for clarity. ↩︎
The EU AI act is the closest thing at the moment, but it might not be very durable as the EU doesn't have that much leverage over tech companies. Also, it wouldn't be very surprising if components of this end up being very unreasonable such that companies are basically forced to ignore parts of it or exit the EU market. ↩︎
Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.
However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario's (Dario Amodei is the CEO of Anthropic), in my earlier post.
I also have serious doubts about whether the LTBT will serve as a meaningful check to ensure Anthropic serves the interests of the public. The LTBT has seemingly done very little thus far—appointing only 1 board member despite being able to appoint 3/5 of the board members (a majority) and the LTBT is down to only 3 members. And none of its members have technical expertise related to AI. (The LTBT trustees seem altruistically motivated and seem like they would be thoughtful about questions about how to widely distribute benefits of AI, but this is different from being able to evaluate whether Anthropic is making good decisions with respect to AI safety.)
Additionally, in this article, Anthropic's general counsel Brian Israel seemingly claims that the board probably couldn't fire the CEO (currently Dario) if the board did this despite believing it would greatly reduce profits to shareholders[1]. Almost all of a board's hard power comes from being able to fire the CEO, so if this claim were to be true, that would greatly undermine the ability of the board (and the LTBT which appoints the board) to ensure Anthropic, a public benefit corporation, serves the interests of the public in cases where this conflicts with shareholder interests. In practice, I think this claim which was seemingly made by the general counsel of Anthropic is likely false and, because Anthropic is a public benefit corporation, the board could fire the CEO and win in court even if they openly thought this would massively reduce shareholder value (so long as the board could show they used a reasonable process to consider shareholder interests and decided that the public interest outweighed in this case). Regardless, this article is evidence that the LTBT won't provide a meaningful check on Anthropic in practice.
Misleading communication about the RSP
On the RSP, this post says:
On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures.
While I think this exact statement might be technically true, people have sometimes interpreted this quote and similar statements as a claim that Anthropic would pause until their safety measures sufficed for more powerful models. I think Anthropic isn't likely to do this; in particular:
- The RSP leaves open the option of revising it to reduce required countermeasures (so pausing is only required until the policy is changed).
- This implies countermeasures would suffice for ensuring a reasonable level of safety, but given that commitments still haven't been made for ASL-4 (the level at which existential or near-existential risks become plausible) and there aren't clear procedural reasons to expect countermeasures to suffice to ensure a reasonable level of safety, I don't think we should assume this will be the case.
- Protections for ASL-3 are defined vaguely rather than having some sort of credible and independent risk analysis process (in addition to best guess countermeasures) and a requirement to ensure risk is sufficiently low with respect to this process. Perhaps ASL-4 requirements will differ; something more procedural seems particularly plausible as I don't see a route to outlining specific tests in advance for ASL-4.
- As mentioned earlier, I expect that if Anthropic ends up being able to build transformatively capable AI systems (as in, AI systems capable of obsoleting all human cognitive labor), they'll fail to provide assurance of a reasonable level of safety. That said, it's worth noting that insofar as Anthropic is actually a more responsible actor (as I currently tentatively think is the case), then from my perspective this choice is probably overall good—though I wish their communication was less misleading.
Anthropic and Anthropic employees often use similar language to this quote when describing the RSP, potentially contributing to a poor sense of what will happen. My impression is that lots of Anthropic employees just haven't thought about this, and believe that Anthropic will behave much more cautiously than I think is plausible (and more cautiously than I think is prudent given other actors).
Other companies have worse policies and governance
While I focus on Anthropic in this comment, it is worth emphasizing that the policies and governance of other AI companies seem substantially worse. xAI, Meta and DeepSeek have no public safety policies at all, though they have said they will make a policy like this. Google DeepMind has published that they are working on making a frontier safety framework with commitments, but thus far they have just listed potential threat models corresponding to model capabilities and security levels without committing to security for specific capability levels. OpenAI has the beta preparedness framework, but the current security requirements seem inadequate and the required mitigations and assessment process for this is unspecified other than saying that the post-mitigation risk must be medium or below prior to deployment and high or below prior to continued development. I don't expect OpenAI to keep the spirit of this commitment in short timelines. OpenAI, Google DeepMind, xAI, Meta, and DeepSeek all have clearly much worse governance than Anthropic.
What could Anthropic do to address my concerns?
Given these concerns about the RSP and the LTBT, what do I think should happen? First, I'll outline some lower cost measures that seem relatively robust and then I'll outline more expensive measures that don't seem obviously good (at least not obviously good to strongly prioritize) but would be needed to make the situation no longer be problematic.
Lower cost measures:
- Have the leadership clarify its views to Anthropic employees (at least alignment science employees) in terms of questions like: "How likely is Anthropic to achieve an absolutely low (e.g., 0.25%) lifetime level of risk (according to various third-party safety experts) if AIs that obsolete top human experts are created in the next 4 years?", "Will Anthropic aim to have an RSP that would be the policy that a responsible developer would follow in a world with reasonable international safety practices?", "How likely is Anthropic to exit from its RSP commitments if this is needed to be a competitive frontier model developer?".
- Clearly communicate to Anthropic employees (or at least a relevant subset of Anthropic employees) about in what circumstances the board could (and should) fire the CEO due to safety/public interest concerns. Additionally, explain the leadership's policies with respect to cases where the board does fire the CEO—does the leadership of Anthropic commit to not fighting such an action?
- Have an employee liaison to the LTBT who provides the LTBT with more information that isn't filtered through the CEO or current board members. Ensure this employee is quite independent-minded, has expertise on AI safety (and ideally security), and ideally is employed by the LTBT rather than Anthropic.
Unfortunately, these measures aren't straightforwardly independently verifiable based on public knowledge. As far as I know, some of these measures could already be in place.
More expensive measures:
- In the above list, I explain various types of information that should be communicated to employees. Ensure that this information is communicated publicly including in relevant places like in the RSP.
- Ensure the LTBT has 2 additional members with technical expertise in AI safety or minimally in security.
- Ensure the LTBT appoints the board members it can currently appoint and that these board members are independent from the company and have their own well-formed views on AI safety.
- Ensure the LTBT has an independent staff including technical safety experts, security experts, and independent lawyers.
Likely objections and my responses
Here are some relevant objections to my points and my responses:
- Objection: "Sure, but from the perspective of most people, AI is unlikely to be existentially risky soon, so from this perspective it isn't that misleading to think of deviating from safe practices as an edge case." Response: To the extent Anthropic has views, Anthropic has the view that existentially risky AI is reasonably likely to be soon and Dario espouses this view. Further, I think this could be clarified in the text: the RSP could note that these commitments are what a responsible developer would do if we were in a world where being a responsible developer was possible while still being competitive (perhaps due to all relevant companies adopting such policies or due to regulation).
- Objection: "Sure, but if other companies followed a similar policy then the RSP commitments would hold in a relatively straightforward way. It's hardly Anthropic's fault if other companies force it to be more reckless than it would like." Response: This may be true, but it doesn't mean that Anthropic isn't being potentially misleading in their description of the situation. They could instead directly describe the situation in less misleading ways.
- Objection: "Sure, but obviously Anthropic can't accurately represent the situation publicly. That would result in bad PR and substantially undermine their business in other ways. To the extent you think Anthropic is a good actor, you shouldn't be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors." Response: This is pretty fair, but I still think Anthropic could at least avoid making substantially misleading statements and ensure employees are well informed (at least for employees for whom this information is very relevant to their job decision-making). I think it is a good policy to correct misleading statements that result in differentially positive impressions and result in the safety community taking worse actions, because not having such a policy in general would result in more exploitation of the safety community.
The article says: "However, even the board members who are selected by the LTBT owe fiduciary obligations to Anthropic's stockholders, Israel says. This nuance means that the board members appointed by the LTBT could probably not pull off an action as drastic as the one taken by OpenAI's board members last November. It's one of the reasons Israel was so confidently able to say, when asked last Thanksgiving, that what happened at OpenAI could never happen at Anthropic. But it also means that the LTBT ultimately has a limited influence on the company: while it will eventually have the power to select and remove a majority of board members, those members will in practice face similar incentives to the rest of the board." This indicates that the board couldn't fire the CEO if they thought this would greatly reduce profits to shareholders though it is somewhat unclear. ↩︎
I think this is very different from RSPs: RSPs are more like "if everyone is racing ahead (and so we feel we must also race), there is some point where we'll still chose to unilaterally stop racing"
In practice, I don't think any currently existing RSP-like policy will result in a company doing this as I discuss here.
DeepSeek's success isn't much of an update on a smaller US-China gap in short timelines because security was already a limiting factor
Some people seem to have updated towards a narrower US-China gap around the time of transformative AI if transformative AI is soon, due to recent releases from DeepSeek. However, since I expect frontier AI companies in the US will have inadequate security in short timelines and China will likely steal their models and algorithmic secrets, I don't consider the current success of China's domestic AI industry to be that much of an update. Furthermore, if DeepSeek or other Chinese companies were in the lead and didn't open-source their models, I expect the US would steal their models and algorithmic secrets. Consequently, I expect these actors to be roughly equal in short timelines, except in their available compute and potentially in how effectively they can utilize AI systems.
I do think that the Chinese AI industry looking more competitive makes security look somewhat less appealing (and especially less politically viable) and makes it look like their adaptation time to stolen models and/or algorithmic secrets will be shorter. Marginal improvements in security still seem important, and ensuring high levels of security prior to at least ASI (and ideally earlier!) is still very important.
Using the breakdown of capabilities I outlined in this prior post, the rough picture I expect is something like:
- AIs that can 10x accelerate AI R&D labor: Security is quite weak (perhaps <=SL3 as defined in "Securing Model Weights"), so the model is easily stolen if relevant actors want to steal it. Relevant actors are somewhat likely to know AI is a big enough deal that stealing it makes sense, but AI is not necessarily considered a top priority.
- Top-Expert-Dominating AI: Security is somewhat improved (perhaps <=SL4), but still pretty doable to steal. Relevant actors are more aware, and the model probably gets stolen.
- Very superhuman AI: I expect security to be improved by this point (partially via AIs working on security), but effort on stealing the model could also plausibly be unprecedentedly high. I currently expect security implemented before this point to suffice to prevent the model from being stolen.
Given this, I expect that key early models will be stolen, including models that can fully substitute for human experts, and so the important differences between actors will mostly be driven by compute, adaptation time, and utilization. Of these, compute seems most important, particularly given that adaptation and utilization time can be accelerated by the AIs themselves.
This analysis suggests that export controls are particularly important, but they would need to apply to hardware used for inference rather than just attempting to prevent large training runs through memory bandwidth limitations or similar restrictions.
Seems very sensitive to the type of misalignment right? As an extreme example suppose literally all AIs have long run and totally inhuman preferences with linear returns. Such AIs might instrumentally decide to be as useful as possible (at least in domains other than safety research) for a while prior to a treacherous turn.
- Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
This requires a capability, but also requires a propensity. For example, smart humans are all capable of avoiding doing armed robbery with pretty high reliability, but some of them do armed robbery despite being told not to do armed robbery at a earlier point in their life. You could say these robbers didn't have the capability to follow instructions, but this would be an atypical use of these (admittedly fuzzy) words.
FWIW, I think recusive self-improvment via just software (software only singularity) is reasonably likely to be feasible (perhaps 55%), but this alone doesn't suffice for takeoff being arbitrary fast.
Further, even objectively very fast takeoff (von Neumann to superintelligence in 6 months) can be enough time to win a war etc.