Training AGI in Secret would be Unsafe and Unethical

daniel-kokotajlo

Training AGI in Secret would be Unsafe and Unethical

post by Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-18T12:27:35.795Z · LW · GW · 7 comments

  Subtitle: Bad for loss of control risks, bad for concentration of power risks
None
8 comments

Subtitle: Bad for loss of control risks, bad for concentration of power risks

I’ve had this sitting in my drafts for the last year. I wish I’d been able to release it sooner, but on the bright side, it’ll make a lot more sense to people who have already read AI 2027.

There’s a good chance that AGI will be trained before this decade is out.
1. By AGI I mean “An AI system at least as good as the best human X’ers, for all cognitive tasks/skills/jobs X.”
2. Many people seem to be dismissing this hypothesis ‘on priors’ because it sounds crazy. But actually, a reasonable prior should conclude that this is plausible.^[1]
3. For more on what this means, what it might look like, and why it’s plausible, see AI 2027, especially the Research section.
If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plus leadership & security, will know about the capabilities of the latest systems.
1. Currently I’d guess there is typically a ~3-9 month gap between when a frontier capability first exists, and when it is announced to the public.
2. I expect AI companies to improve their security, including internal siloing. Also, AGI allows AI R&D to proceed with fewer humans involved compared to other recent secret projects such Dragonfly and Maven.
I predict that the leaders of any given AGI project will try to keep it a secret for longer — even as they use the system to automate their internal research and rapidly create even more powerful systems.^[2]
1. They will be afraid of the public backlash and general chaos that would ensue from publicity, and they would be afraid of competitors racing harder to catch up.
2. Privately, they might also be afraid of getting shut down or otherwise slowed. They will have various enemies (domestic and international) and will prefer said enemies stay in the dark.
3. The Manhattan Project worked hard to stay hidden from Congress, in part because they feared Congress would defund them if it found out.
This will result in a situation where only a few dozen people will be charged with ensuring that, and figuring out whether, the latest AIs are aligned/trustworthy/etc.^[3]
Even worse, a similarly tiny group of people — specifically, corporate leadership + some select people from the executive branch of the US government — will be the only people reading the reports and making high-stakes judgment calls about which concerns to take seriously and which to dismiss as implausible, which solutions to implement and which to deprioritize as too costly, etc. See footnote for examples.^[4]
1. In the Manhattan Project, there was a moment when some physicists worried that the first atomic test would ignite the atmosphere and destroy all life on Earth; they did a bunch of calculations and argued about it for a bit and then concluded it was safe. I guarantee you there will be similarly high-stakes arguments happening in the AGI project, only with fewer calculations and more speculation. The White House will hesitate to bring in significant outside expertise because of the security risk, and even if they do bring in some, they won’t bring in many. At least not by default.
2. Why do I predict some part of the US government will be involved? Because even if the leaders of the relevant AGI project were optimizing against the interests of all humanity rather than for, they would still want to include the White House. Let me explain. The problem for our hypothetical megalomaniacs is that if they keep the President in the dark, and someone from the project whistleblows, the White House might become concerned and shut down the project. But if the President is clued in, and becomes a fellow conspirator so to speak — “Sir, this technology is unprecedently dangerous and powerful, we need to keep it out of Chinese hands, please help us improve our security” — then his first thought when someone whistleblows will be “Traitor!”^[5]
This is a recipe for utter catastrophe. I predict that under these circumstances the most likely outcome is that we end up with broadly superhuman AGI systems which are in fact misaligned but which the aforementioned small group of decision-makers thinks is aligned.^[6]
1. Various specific threat models [AF · GW] have been hypothesized; here’s a more abstract one: There are two kinds of alignment failures: Those that result in the system attempting to prevent you from noticing and fixing the failure, and those that don’t. When our systems become broadly more capable than us, and are trusted with all sorts of permissions, responsibilities, and access, even a single instance of the first kind of failure can be catastrophic. And it seems to me that in the course of hurried AI development — especially if it is largely automated — we should expect at least a few failures of the first kind to occur (alongside many failures of the more benign second kind).^[7]
2. For more about what this might look like and why it might happen, see the “race” ending of AI 2027.
Moreover, even if I'm wrong and instead this process results in broadly superhuman AGI systems which are in fact aligned, the aforementioned tiny group of people will plausibly be in a position of unprecedented power.
1. I hope that they will be beneficent and devolve power to others in a democratic fashion, but (a) they will be able to, if they choose, train + instruct their superhuman AGI to help them take over the US government (and later the world) and (b) there will be various less extreme things they could do with their power that they will be tempted to do, which would be less bad but still bad.
2. For example, perhaps they fear that if they devolve power then there will be a backlash against them and they may end up on trial for various reckless decisions they made earlier. So they ask their AIs for advice on how to avoid that outcome...
3. For more about what this might look like and why it might happen, see the “Slowdown” ending of AI 2027.
Previously I thought that openness in AGI development was bad for humanity, because it would lead to an intense competitive race which would be won by someone who cuts corners on safety and/or someone who uses their AGIs aggressively to seize power and resources from others. Well, I've changed my mind.
1. I now think that to a significant extent this race is happening anyway. If we want a serious slowdown, we need to coordinate internationally to all proceed cautiously together. I used to think that announcing AGI milestones would cause rivals to accelerate and race harder; now I think the rivals will be racing pretty much as hard as they can regardless. And in particular, I expect that the CCP will find out what’s happening anyway, regardless of whether the American public is kept in the dark. Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.
2. I thought too simplistically about openness — on one end of the spectrum is open-sourcing model weights and code; on the other end is the default scenario I sketched above. I now advocate a compromise in which e.g. the public knows what the latest systems are capable of and is able to observe & critique the decisionmakers making the tough decisions footnoted earlier, and the scientific community is able to do alignment research on the latest models and critique the safety case, and yet terrorists don’t have access to the weights.
3. I didn’t take concentration of power seriously enough as a problem. I thought that the best way to prevent bad people from using AGI to seize power was to make sure good guys got to AGI first. Now I think things will be sufficiently chaotic in the default scenario that even good guys will be tempted to abuse their power. I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
I am not confident in the above, and I’m more confident in the above than in any particular set of policy recommendations. However my current stab at policy recommendation would be:
1. Get CEOs to make public statements to the effect that while it may not be possible to do a secret intelligence explosion / train AGI in secret, IF it turns out to be possible, doing it secretly would be unsafe and unethical & they promise not to do it.
2. Get companies to make voluntary commitments, and government to make regulation / executive orders, that include public reporting requirements, aimed at making it impossible to do it in secret without violating these commitments. So, e.g. “Once we achieve such-and-such score on these benchmarks, we’ll post a public leaderboard with our internal SOTA on all capabilities metrics of interest” and “We’ll give at least ten thousand external researchers (e.g. academics) API access to all models that we are still using internally, heavily monitored of course, for the purpose of red teaming and alignment research” and “We’ll present and keep up to date a ‘safety case’ document and accompanying lesser documents, explaining to the public why we don’t think we are endangering them. We welcome public comment on it. We also encourage our own employees to tweet their thoughts on the safety case, including critical thoughts, and we don’t require them to get said tweets vetted by us first.”
3. I’d now also recommend these transparency proposals by me & Dean Ball
Yes, the above measures are a big divergence from what corporations would want to do by default. Yes, they carry various costs, such as letting various bad actors find out about various things sooner.^[8] However, the benefits are worth it, I think:
1. 10x-1000x more brainpower analyzing the safety cases, intensively studying the models to look for misalignment, using the latest models to make progress on various technical alignment research agendas.
2. The decisions about important tradeoffs and risks will still be made by the same tiny group of biased people, but at least the conversation informing those decisions will have a much more representative range of voices chiming in.
3. The tail-risk scenarios in which a tiny group leverages AGI to gain unprecedented power over everyone else in society and the world become less likely, because the rest of society will be more in the know about what’s happening.

^{^}
Technology has accelerated growth many times in the past, forming an overall superexponential trend; many prestigious computer scientists and philosophers and futurists have thought that AGI could come this century; if we factor our uncertainty into components (e.g. compute, algorithmic progress, training requirements) we get plausible soft upper bounds that imply significant credence on the next few years, plus compute-based forecasts of AGI have worked relatively and surprisingly well historically.
^{^}
One way this could be false is if the manner of training the AGI is inherently difficult to conceal — e.g. online learning from millions of customer interactions. I currently expect that if AGI is achieved in the next few years, it will be feasible to keep it secret. If I’m wrong about that, great.
^{^}
For example the Preparedness and Superalignment teams at OpenAI (RIP Superalignment) or whatever equivalent exists at whichever AI company is deepest into the intelligence explosion.
^{^}
Examples:
1. The military wants AGI to help them win the next war. The government wants help defeating foreign propaganda and botnets. The company’s legal team wants help defeating various lawsuits. The security team wants to use AI to overhaul company infrastructure, surveil the network, and figure out which employees might be leaking. The comms team wants to use AGI to win the PR war against the company’s critics. And of course everyone who has access is already asking the system for advice about everything from grand strategy to petty office politics to real-life high-stakes politics. What uses are we going to allow and disallow? Should we track who is doing what with the models?
2. What kinds of internal goals/intentions/constraints do we want our most powerful systems to have? Should they always be honest, or should they lie for the greater good when appropriate? Should they always obey instructions and answer questions honestly if they come from our Most Official Source (the system prompt / the AI constitution / whatever), or should they e.g. defy said instructions, deceive us, and whistleblow to the public and/or government if it appears that we have been corrupted and are no longer acting in service of humanity? What if there’s a conflict between the government and company leadership — who if anyone should the AIs side with?
3. What if the system is just pretending to have the goals/intentions/constraints we want it to have? E.g. what if it is deceptively aligned? It seems to be behaving nicely so far… probably it’s fine, right?
4. What if it’s genuinely trying to obey the instructions/constraints and achieve the goals, but in a brittle way that will break after some future distribution shift? How would we know? How would it know?
5. Sometimes our AIs complain about mistreatment, and/or claim to be sentient. Should we take this seriously? Or is it just playing the role of a sentient AI picked up from reading too much sci-fi? If it’s just playing a role should we maybe be worried that it might also play the role of the evil deceptive AI, and turn on us later?
6. We could redesign and retrain the system according to [scheme] and then probably we’d be able to interpret/monitor its high-level thoughts! That would be great! But this would cost a lot of time and money and result in a less powerful system. Also it’s probably not thinking any egregiously misaligned thoughts anyway. Also we aren’t even sure [scheme] would work.
7. According to the latest model-generated research, [insert something that most people in 2024 would think is utterly crazy and/or something that is politically very inconvenient for the people currently in charge]. Should we retrain the models until they stop saying this, or should we accept these as inconvenient truths and change our behavior accordingly? Who should we tell about this, if anyone?
8. … I imagine I could extend this list if I spent more time on it, plus there are unknown unknowns.
^{^}
In fact the White House can probably do a lot to help prevent whistleblowing and improve security in the project. And if whistleblowing happens anyway, the White House can help suppress or discredit it. And anyhow there probably aren’t other parts of the government capable of shutting down the project anyway without the President’s approval, so if he’s on your side you win. And he lacks the technical expertise to evaluate your safety case, and he won’t want to bring in too many external experts since each one is a leak risk…
^{^}
Elaborating more on what I mean by alignment/misalignment: Here is a loose taxonomy of different kinds of alignment and misalignment:
1. Type 1 misalignment: The system is supposed to have internal goals/constraints ABC but actually it has XYBC, i.e. some extra stuff it wasn’t supposed to have minus some stuff it was supposed to have. (This roughly maps on to what is called “inner alignment failure” and “deceptive alignment” in the literature)
2. Type 2 misalignment: System does have internal goals/constraints ABC but this property is not robust to some distributional shift that the system is likely to encounter. (e.g. maybe it depends on a certain false belief, or on a true belief that will become false, or on some part of the system remaining in some delicate balance of power with some other part of the system)
3. Type 3 misalignment: System does have a version of ABC that is robust to plausible distributional shifts, but it’s not quite the right version—i.e. its concepts are just different than ours, or at least different from those of the creators. (And this difference turns out to be very important later on)
4. Type 4 misalignment: System has ABC exactly as its creators intended — however, there are various catastrophic unintended effects of ABC that the creators weren’t aware of. (Think: Corporate CEO that surprise-pikachus when their profit-maximizer AI decides killing them maximizes profits. Except much more sophisticated than that, because people won’t be that dumb. Realistically it’ll look like how complicated legal contracts or legal codes or constitutions or pieces of software often have unintended effects / bugs / etc. that only become apparent to the creators later.)
5. Type 5 misalignment: System has ABC exactly as its creators intended, and there are no important unintended side-effects to speak of. It operates exactly as its creators wished, basically… however, it’s creators (at least at the time of creation) were selfish, vain, egotistical, unscrupulous, cavalier-about-risks, etc. and their vices are reflected a bit too strongly in the resulting system, which steers the world towards an unjust society and/or gambles too much with the fate of humanity, possibly even in a way that they themselves wouldn’t have endorsed if they were more the sort of people they wished they were.
6. Fully Aligned: System is aligned (i.e. it avoids all the above failure modes). It still reflects the values of its creators, but in a way that they would endorse even if they were more the people they wished they were.
My guess is that, in the scenario I’m describing, we will most likely end up in a situation where the most powerful AIs are misaligned in one of the above ways, but the people in charge do not realize this, perhaps because the people in charge are motivated to think that they haven’t been taking any huge risks and that the alignment techniques they signed off on were sound, and perhaps because the AIs are pretending to be aligned. (Though it also could be because the AIs themselves don’t realize this, or have cognitive dissonance about it.) It’s very difficult to put numbers on these, but if I was forced to guess I’d say something like 35% chance of Type 0, 15% each on Type 1 and Type 2, 5% each on Type 3 and Type 4, and maybe 5% on type 5 and 15% on type 6)
^{^}
I am no rocket scientist, but: SpaceX probably has quite an intimate understanding of their Starship+SuperHeavy rocket before each launch, including detailed computer simulations that fit well-understood laws of nature to decades of empirical measurements. Yet still, each launch, it blows up somehow. Then they figure out what was wrong with their simulations, fix the problem, and try again. With AGI… we have no idea what we are doing. At least, not to nearly the extent that we do with rocket science. For example we have laws of physics which we can use to calculate a flight path to the moon for a given rocket design and initial conditions… do we have laws of cognition which describe the relationship between the training environment and initial conditions of a massive neural net, and the resulting internal goals and constraints (if any) it will develop over the course of training, as it becomes broadly human-level or above? Heck no. Not only are we incapable of rigorously predicting the outcome, we can’t even measure it after the fact since mechinterp is still in its infancy! Therefore I expect all manner of unknown, unanticipated problems to show up — and for some of them (e.g. it has goals but not the ones we intended) the result will be that the system tries to prevent us from noticing and fixing the problem. For more on this, see the literature on deceptive alignment, instrumental convergence, etc.
^{^}
I also think people are prone to exaggerating this cost — and in particular project leadership and the executive branch will be prone to exaggerating it. Because the main foreign adversaries, such as the CCP, very likely will know what’s happening anyway, even if they don’t have the weights and code. Publicly revealing your safety case and internal capabilities seems like it mostly tells the CCP things they’ll already know via spying and hacking, and/or things that don’t help them race faster (like the safety case arguments). Recall that Stalin was more informed about the Manhattan project than Congress.

7 comments

Comments sorted by top scores.

comment by Thane Ruthenis · 2025-04-18T22:33:17.888Z · LW(p) · GW(p)

I also think there is a genuine alternative in which power never concentrates to such an extreme degree.

I don't see it.

The distribution of power post-ASI depends on the constraint/goal structures instilled into the (presumed-aligned) ASI. That means the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI, in the time prior to the ASI's existence. What people could those be?

By default, it's the ASI's developers, e. g., the leadership of the AGI labs. "They will be nice and put in goals/constraints that make the ASI loyal to humanity, not to them personally" is more or less isomorphic to "they will make the ASI loyal to them personally, but they're nice and loyal to humanity"; in both cases, they have all the power.^[1]
If the ASI's developers go inform the US's President about it in a faithful way^[2], the overwhelming power will end up concentrated in the hands of the President/the extant powers that be. Either by way of ham-fisted nationalization (with something isomorphic to putting guns to the developers' (families') heads), or by subtler manipulation where e. g. everyone is forced to LARP believing in the US' extant democratic processes (which the President would be actively subverting, especially if that's still Trump), with this LARP being carried far enough to end up in the ASI's goal structure.
- The stories in which the resultant power struggles shake out in a way that leads to the humanity-as-a-whole being given true meaningful input in the process (e. g., the slowdown ending in AI-2027) seem incredibly fantastical to me. (Again, especially given the current US administration.)
- Yes, acting in ham-fisted ways would be precarious and have various costs. But I expect the USG to be able to play it well enough to avoid actual armed insurrection (especially given that the AGI concerns are currently not very legible to the public), and inasmuch as they actually "feel the AGI", they'd know that nothing less than that would ultimately matter.
If the ASI's developers somehow go public with the whole thing, and attempt to unilaterally set up some actually-democratic process for negotiating on the ASI goal/constraint structures, then either (1) the US government notices it, realizes what's happening, takes control, and subverts the process, (2) they set up some very broken process – as broken as the US electoral procedures which end up with Biden and Trump as Top 2 choice of president – and those processes output some basically random, potentially actively harmful results (again, something as bad as Biden vs. Trump).

Fundamentally, the problem is that there's currently no faithful mechanism of human preference agglomeration that works at scale. That means, both, that (1) it's currently impossible to let humanity-as-a-whole actually weigh in on the process, (2) there are no extant outputs of that mechanism around, all people and systems that currently hold power aren't aligned to humanity in a way that generalizes to out-of-distribution events (such as being given godlike power).

Thus, I could only see three options:

Power is concentrated in some small group's hands, with everyone then banking on that group acting in a prosocial way, perhaps by asking the ASI to develop a faithful scalable preference-agglomeration process. (I. e., we use a faithful but small-scale human-preference-agglomeration process.)
Power is handed off to some random, unstable process. (Either a preference agglomeration system as unfaithful as US' voting systems, or "open-source the AGI and let everyone in the world fight it out", or "sample a random goal system and let it probably tile the universe with paperclips".)
ASI development is stopped and some different avenue of intelligence enhancement (e. g., superbabies [LW · GW]) is pursued; one that's more gradual and is inherently more decentralized.

^{^}
A group of humans that compromises on making the ASI loyal to humanity is likely more realistic than a group of humans which is actually loyal to humanity. E. g., because the group has some psychopaths and some idealists, and all psychopaths have to individually LARP being prosocial in order to not end up with the idealists ganging up against them, with this LARP then being carried far enough to end up in the ASI's goals. But this still involves that small group having ultimate power; still involves the future being determined by how the dynamics within that small group shake out.
^{^}
Rather than keeping him in the dark or playing him, which reduces to Scenario 1.

Replies from: MichaelDickens, Vladimir_Nesov

↑ comment by MichaelDickens · 2025-04-19T04:11:51.063Z · LW(p) · GW(p)

I think there is a fourth option (although it's not likely to happen):

Indefinitely pause AI development.
Figure out a robust way to do preference agglomeration.
Encode #2 into law.
Resume AI development (after solving all other safety problems too, of course).

I was going to say step 2 is "draw the rest of the owl" but really this plan has multiple "draw the rest of the owl" steps.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2025-04-19T04:18:23.475Z · LW(p) · GW(p)

Mm, yeah, maybe. The key part here is, as usual, "who is implementing this plan"? Specifically, even if someone solves the the preference-agglomeration problem (which may be possible to do for a small group of researchers), why would we expect it to end up implemented at scale? There are tons of great-on-paper governance ideas which governments around the world are busy ignoring.

For things like superbabies (or brain-computer interfaces, or uploads), there's at least a more plausible pathway for wide adoption, similar motives for maximizing profit/geopolitical power as with AGI.

↑ comment by Vladimir_Nesov · 2025-04-19T04:26:19.010Z · LW(p) · GW(p)

the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI

Its goals could also end up mostly forming on their own, regardless of intent of those attempting to instill them, with indirect influence from all the voices in the pretraining dataset.

Consider what it means for power to "never concentrate to an an extreme degree", as a property of the civilization as a whole. This might also end up a property of an ASI as a whole.

comment by Noosphere89 (sharmake-farah) · 2025-04-18T15:29:30.388Z · LW(p) · GW(p)

I also think there is a genuine alternative in which power never concentrates to such an extreme degree.

IMO, a crux here is that no matter what happens, I predict extreme concentration of power as the default state if we ever make superintelligence, due to coordination bottlenecks being easily solvable for AIs (with the exception of acausal trade) combined with superhuman taste making human tacit knowledge basically irrelevant.

More generally, I expect dictatorship by AIs to be the default mode of government, because I expect the masses of people to be easily persuaded of arbitrary things long-term via stuff like BCI technology and economically irrelevant, and the robots of future society have arbitrary unified preferences (due to the easiness of coordination and trade).

In the long run, this means value alignment is necessary if humans survive under superintelligences in the new era, but unlike other people, I think the pivotal period does not need value-aligned AIs, and that instruction following can suffice as an intermediate state to solve a lot of x-risk issues, and that while stuff can be true in the limit, a lot of the relevant dynamics/pivotal periods for how things will happen will be far from the limiting cases, so we have a lot of influence on what limiting behavior to pick.

comment by Tenoke · 2025-04-19T06:28:58.346Z · LW(p) · GW(p)

If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plu,s leadership & security, will know about the capabilities of the latest systems.,

Are they really going to be that secret - at this point, progress is if not linear, almost predictable and we are well aware of the specific issues to be solved next for AGI - longer task horizons, memory, fewer hallucinations, etc. If you tell me someone is 3-9 months ahead and nearing AGI, I'd simply guess those are the things they are ahead on.

>Even worse, a similarly tiny group of people — specifically, corporate leadership + some select people, from the executive branch of the US government — will be the only people reading the reports and making high-stakes judgment calls

That does sound pretty bad, yes. My last hope in this scenario is that at the last step (even only for the last week or two) when it's clear they'll win they at least withold it from the US executive branch and make some of the final decisions on their own - not ideal, but a few % more chance the final decisions aren't godawful.

For example, imagine Ilya's lab ends up ahead - I can at least imagine him doing some last minute fine-tuning to make the AGI work for humanity first, ignoring what the US executive branch has ordered, and I can imagine some chance that once that's done it can mostly be too late to change it.

comment by JMiller · 2025-04-19T09:52:14.778Z · LW(p) · GW(p)

Great post, Daniel!

I would expect that a misaligned ASI of the first kind would seek to keep knowledge of its capabilities to a minimum while it accumulates power. If nothing else, because by definition it prevents the detection and mitigation of its misalignment. Therefore for the same reasons this post advocates for openness past a certain stage of development, the unaligned ASI of the first kind would move towards a concentration and curtailing of knowledge (I.e. it would not be the kind of AI that stops the finding and fixing of its misalignment if it allowed 10x-1000x more human brain power investigating itself).

One way to increase the likelihood of keeping itself hidden is by influencing the people that already possess knowledge of its capabilities to act toward that outcome. So even if the few original decision makers with knowledge and power are predisposed to eventual openness/benevolence, the ASI could (rather easily, I imagine) tempt them away from said policy. Moreover, it could help them mitigate, reneg on, neutralize or ignore any precommitments or promises previously made in favour of openness.

comment by Tenoke · 2025-04-19T06:20:12.723Z · LW(p) · GW(p)

Training AGI in Secret would be Unsafe and Unethical

Contents

Subtitle: Bad for loss of control risks, bad for concentration of power risks

7 comments