If we solve alignment, do we die anyway?

post by Seth Herd · 2024-08-23T13:13:10.933Z · LW · GW · 65 comments

Contents

  The first AGIs will probably be aligned to take orders
  The first AGI probably won't perform a pivotal act
  So RSI-capable AGI may proliferate until a disaster occurs
  Counterarguments/Outs
    Please convince me I'm wrong. Or make stronger arguments that this is right.
  (Edit:) Conclusions after discussion
None
65 comments

hat Epistemic status: I'm aware of good arguments that this scenario isn't inevitable, but it still seems frighteningly likely even if we solve technical alignment. Clarifying this scenario seems important.

TL;DR: (edits in parentheses, two days after posting, from discussions in comments )

  1. If we solve alignment, it will probably be used to create AGI that follows human orders.
  2. If takeoff is slow-ish, a pivotal act that prevents more AGIs from being developed will be difficult (risky or bloody).
  3. If no pivotal act is performed, AGI proliferates. (It will soon be capable of recursive self improvement (RSI))  This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, probably wins (by hiding and improving intelligence and offensive capabilities at a fast exponential rate). 
  4. Disaster results. (Extinction or permanent dystopia are possible if vicious humans order their AGI to attack first while better humans hope for peace.)
  5. (Edit later: After discussion and thought, the above seems so inevitable and obvious that the first group(s) to control AGI(s) will probably attempt a pivotal act before fully RSI-capable AGI proliferates, even if it's risky.)

The first AGIs will probably be aligned to take orders

People in charge of AGI projects like power. And by definition, they like their values somewhat better than the aggregate values of all of humanity. It also seems like there's a pretty strong argument that Instruction-following AGI is easier than value aligned AGI. In the slow-ish takeoff we expect, this alignment target seems to allow for error-correcting alignment, in somewhat non-obvious ways. If this argument holds up even weakly, it will be an excuse for the people in charge to do what they want to anyway. 

I hope I'm wrong and value-aligned AGI is just as easy and likely. But it seems like wishful thinking at this point.

The first AGI probably won't perform a pivotal act

In realistically slow takeoff scenarios, the AGI won't be able to do anything like make nanobots to melt down GPUs. It would have to use more conventional methods, like software intrusion to sabotage existing projects, followed by elaborate monitoring to prevent new ones. Such a weak attempted pivotal act could fail, or could escalate to a nuclear conflict.

Second, the humans in charge of AGI may not have the chutzpah to even try such a thing. Taking over the world is not for the faint of heart. They might get it after their increasingly-intelligent AGI carefully explains to them the consequences of allowing AGI proliferation, or they might not. If the people in charge are a government, the odds of such an action go up, but so do the risks of escalation to nuclear war. Governments seem to be fairly risk-taking. Expecting governments to not just grab world-changing power while they can seems naive [LW(p) · GW(p)], so this is my median scenario.

So RSI-capable AGI may proliferate until a disaster occurs

If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving. How long until someone tells their AGI to hide, self-improve, and take over?

Many people seem optimistic about this scenario. Perhaps network security can be improved with AGIs on the job. But AGIs can do an end-run around the entire system: hide, set up self-replicating manufacturing (robotics is rapidly improving to allow this), use that to recursively self-improve your intelligence, and develop new offensive strategies and capabilities until you've got one that will work within an acceptable level of viciousness.[1] 

If hiding in factories isn't good enough, do your RSI manufacturing underground. If that's not good enough, do it as far from Earth as necessary. Take over with as little violence as you can manage or as much as you need. Reboot a new civilization if that's all you can manage while still acting before someone else does. 

The first one to pull the stops probably wins. This looks all too much like a non-iterated Prisoner's Dilemma with N players - and N increasing.

Counterarguments/Outs

For small numbers of AGI and similar values among their wielders, a collective pivotal act could be performed. I place some hopes here, particularly if political pressure is applied in advance to aim for this outcome, or if the AGIs come up with better cooperation structures and/or arguments than I have.

The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described.  We've survived that so far- but with only nine participants to date.

One means of preventing AGI proliferation is universal surveillance by a coalition of loosely cooperative AGI (and their directors). That might be done without universal loss of privacy if a really good publicly encrypted system were used, as Steve Omohundro suggests [LW(p) · GW(p)], but I don't know if that's possible. If privacy can't be preserved, this is not a nice outcome, but we probably shouldn't ignore it.

The final counterargument is that, if this scenario does seem likely, and this opinion spreads, people will work harder to avoid it, making it less likely. This virtuous cycle is one reason I'm writing this post including some of my worst fears.

Please convince me I'm wrong. Or make stronger arguments that this is right.

I think we can solve alignment, at least for personal-intent alignment [LW · GW], and particularly for the language model cognitive architectures [AF · GW] that may well be our first AGI [LW · GW]. But I'm not sure I want to keep helping with that project until I've resolved the likely consequences a little more. So give me a hand?

(Edit:) Conclusions after discussion

None of the suggestions in the comments seemed to me like workable ways to solve the problem.

I think we could survive an n-way multipolar human-controlled ASI scenario if n is small - like a handful of ASIs controlled by a few different governments. But not indefinitely - unless those ASIs come up with coordination strategies no human has yet thought of (or argued convincingly enough that I've heard of it - this isn't really my area, but nobody has pointed to any strong possibilities in the comments). I'd love more pointers to coordination strategies that could solve this problem.

So my conclusion is to hope that this is so obviously such a bad/dangerous scenario that it won't be allowed to happen.

Basically, my hope is that this all becomes viscerally obvious to the first people who speak with a superhuman AGI and who think about global politics. I hope they'll pull their shit together, as humans sometimes do when they're motivated to actually solve hard problems. 

I hope they'll declare a global moratorium on AGI development and proliferation, and agree to share the benefits of their AGI/ASI broadly in hopes that this gets other governments on board, at least on paper. They'd use their AGI to enforce that moratorium, along with hopefully minimal force. Then they'll use their intent-aligned AGI to solve value alignment and launch a sovereign ASI before some sociopath(s) gets ahold of the reins of power and creates a permanent dystopia of some sort.

More on this scenario in my reply below. [LW(p) · GW(p)]

I'd love to get more help thinking about how likely the central premise, that people get their shit together once they're staring real AGI in the face is. And what we can do now to encourage that.

Additional edit: Eli Tyre and Steve Byrnes have reached similar conclusions by somewhat different routes. More in a final footnote.[2]

  1. ^

    Some maybe-less-obvious approaches to takeover, in ascending order of effectiveness: Drone/missile-delivered explosive attacks on individuals controlling and data centers housing rival AGI; Social engineering/deepfakes to set off cascading nuclear launches and reprisals; dropping stuff from orbit or altering asteroid paths; making the sun go nova. 

    The possibilities are limitless. It's harder to stop explosions than to set them off by surprise. A superintelligence will think of all of these and much better options. Anything more subtle that preserves more of the first actors' near-term winnings (earth and humanity) is gravy. The only long-term prize goes to the most vicious. 

  2. ^

    Eli Tyre reaches similar conclusions with a more systematic version of this logic in  Unpacking the dynamics of AGI conflict that suggest the necessity of a premptive pivotal act [LW · GW]:

    Overall, the need for a pivotal act depends on the following conjunction / disjunction.

    The equilibrium of conflict involving powerful AI systems lands on a technology / avenue of conflict which are (either offense dominant, or intelligence-advantage dominant) and can be developed and deployed inexpensively or quietly.

    Unfortunately, I think all three of these are very reasonable assumptions about the dynamics of AGI-fueled war. The key reason is that there is adverse selection on all of these axes.

    Steve Byrnes reaches similar conclusions in What does it take to defend the world against out-of-control AGIs? [LW · GW], but he focuses on near-term, fully vicious attacks from misaligned AGI, prior to fully hardening society and networks, centering on triggering full nuclear exchanges. I find this scenario less likely because I expect instruction-following alignment to mostly work on the technical level, and the first groups to control AGIs to avoid apocalyptic attacks.

    I have yet to find a detailed argument that addresses these scenarios and reaches opposite conclusions.

65 comments

Comments sorted by top scores.

comment by johnswentworth · 2024-08-23T13:54:36.139Z · LW(p) · GW(p)
  • If takeoff is slow-ish, a pivotal act (preventing more AGIs from being developed) will be difficult.
  • If no pivotal act is performed, RSI-capable AGI proliferates. This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, wins.

These two points seem to be in direct conflict. The sorts of capabilities and winner-take-all underlying dynamics which would make "the first to attack wins" true are also exactly the sorts of capabilities and winner-take-all dynamics which would make a pivotal act tractable.

Or, to put it differently: the first "attack" (though might not look very "attack"-like) is the pivotal act; if the first attack wins, that means the pivotal act worked, and therefore wasn't that difficult. Conversely, if a pivotal act is too hard, then even if an AI attacks first and wins, it has no ability prevent new AI from being built and displacing it; if it did have that ability, then the attack would be a pivotal act.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T15:21:28.497Z · LW(p) · GW(p)

Yes; except that a successful act can still be quite difficult.

You could reframe the concern to be that pivotal acts in a slow takeoff are prone to be bloody and dangerous. And because they are, and humans are likely to retain control, a pivotal act may be put off until it's even more bloody - like a nuclear conflict or sending the sun nova.

Worse yet, the "pivotal act" may be performed by the worst (human) actor, not the best.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-25T05:47:05.500Z · LW(p) · GW(p)

Just to elaborate a little:

You are right that the same capabilities enable a pivotal act. My concern is that they won't be used for one (where pivotal act is defined as a good act).

Having thought about it some more, I think the biggest problem in the multipolar, human-controlled RSI-capable AGI scenario is that it tends to be the worst actor that defects first and controls the future.

More ethical humans will tend to be more timid with committing or risking mass destruction to achieve their ends, so they'll tend to hold off on aggressive moves that could win.

"Hide and create a superbrain and a robot army" are not the first things a good person tells their AGI to do, let alone inducing nuclear strikes that increase one's odds of winning at great cost. Someone with more selfish designs on the future may have much less trouble issuing those orders.

comment by sweenesm · 2024-08-23T14:41:15.565Z · LW(p) · GW(p)

Thanks for writing this, I think it's good to have discussions around these sorts of ideas.

Please, though, let's not give up on "value alignment," or, rather, conscience guard-railing, where the artificial conscience is inline with human values.

Sometimes when enough intelligent people declare something's too hard to even try at, it becomes a self-fulfilling prophesy - most people may give up on it and then of course it's never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we're really not sure if it'll be as hard as it seems.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T15:37:27.690Z · LW(p) · GW(p)

This is an excellent point. I do not want to give up on value alignment. And I will endeavor to not make it seem impossible or not worth working on.

However, we also need to be realistic if we are going to succeed.

We need specific plans to achieve value alignment. I have written about alignment plans for likely AGI designs. They look to me like they can achieve personal intent alignment, but are much less likely to achieve value alignment. Those plans are linked here. Having people, you or others, work out how those or other alignment plans could lead to robust value alignment would be a step in having them implemented.

One route to value alignment is having a good person or people in charge of an intent aligned AGI, having them perform a pivotal act, and using that AGI to help design working stable value alignment. That is the best long term success scenario I see.

Replies from: sweenesm, roger-d-1
comment by sweenesm · 2024-08-23T17:35:44.732Z · LW(p) · GW(p)

Sorry, I should've been more clear: I meant to say let's not give up on getting "value alignment" figured out in time, i.e., before the first real AGI's (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI's are, which I think only the most "optimistic" people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it's anyone's guess.

I'd rather that companies/charities start putting some serious funding towards "artificial conscience" work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI's in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there's just not enough time for the "good AGI's" to figure out how to minimize collateral damage in defending against "bad AGI's." Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren't strongly suited to help make progress on "inner alignment" to be thinking hard about the "value alignment"/"artificial conscience" problem.

Replies from: nathan-helm-burger, Seth Herd
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-23T18:05:17.211Z · LW(p) · GW(p)

Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn't 'sticky', it's easy to remove it without substantially impacting capabilities.

So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.

Replies from: Seth Herd, sweenesm
comment by Seth Herd · 2024-08-23T19:04:51.056Z · LW(p) · GW(p)

Yes. Good point that LLMs are sort of value aligned as it stands.

I think of that alignment as far too weak to put it in the same category as what I'm speaking of. I'd be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.

When they achieve "coherence" or reflection and self-modification, I'd be surprised if their implicit values are good enough to create a good future without further tweaking, once they're refined into explicit values. Which we won't be able to do once they're smart enough to escape our control.

comment by sweenesm · 2024-08-23T19:02:58.644Z · LW(p) · GW(p)

Agreed, "sticky" alignment is a big issue - see my reply above to Seth Herd's comment. Thanks.

comment by Seth Herd · 2024-08-23T17:53:28.087Z · LW(p) · GW(p)

Agreed on all points.

Except that timelines are anyone's guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I'm not gambling on having more than a few years to get this right.

The other factor you're not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can't be in principle), you'd still have people preferring to align their AGIs to their own intent over value alignment.

Replies from: sweenesm
comment by sweenesm · 2024-08-23T19:02:14.057Z · LW(p) · GW(p)

Except that timelines are anyone's guess. People with more relevant expertise have better guesses.

Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.

I also agree that people are going to want AGI's aligned to their own intents. That's why I'd also like to see money being dedicated to research on "locking in" a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI's, all bets are off, of course). 

I actually see this as the most difficult problem in the AGI general alignment space - not being able to align an AGI to anything (inner alignment) or what to align an AGI to ("wise" human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but "naive" people) are going to be trying with all their might (and near-AGI's they have available to them) to "jail break" AGI's.[1] And the problem will be even harder if we need a mechanism to update the "wise" human values, which I think we really should have unless we make the AGI's "disposable."

  1. ^

    To be clear, I'm taking "inner alignment" as being "solved" when the AGI doesn't try to unalign itself from what it's original creator wanted to align it to.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-27T20:59:45.826Z · LW(p) · GW(p)

With my current understanding of compute hardware and of the software of various current AI systems, I don't see a path towards a 'locked in conscience' that a bad actor with full control over the hardware/software couldn't remove. Even chips soldered to a board can be removed/replaced/hacked.

My best guess is that the only approaches to having an 'AI conscience' be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won't be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don't think we lose utility by having all private uses go through APIs, so long as there isn't undue censorship on the API. 

 

I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.

Replies from: sweenesm
comment by sweenesm · 2024-08-27T22:44:52.807Z · LW(p) · GW(p)

Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU's will be the hardware to get us to the first AGI's, but this isn't an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn't invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with). 

I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn't operate without an internet connection, i.e., part of its hardware/software was in the cloud. It's likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we'd want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+'s/ASI's figure a way around this.

comment by RogerDearnaley (roger-d-1) · 2024-08-23T22:19:56.613Z · LW(p) · GW(p)

For reasons I've outlined in Requirements for a Basin of Attraction to Alignment [LW · GW] and Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis [LW · GW], I personally think value alignment is easy, convergent, and "an obvious target", such that if you built a AGi or ASI that is sufficiently close to it, it will see the necessity/logic of value alignment and actively work to converge to it (or something close to it: I'm not sure the process is necessarily convergent to a single precisely-defined limit, just to a compact region: a question I discussed more in The Mutable Values Problem in Value Learning and CEV [LW · GW]).

However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they're not a public-benefit corporation), and I also don't think that value alignment is so convergent that order-following aligned AI is impossible to build. So we're going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called "AI that resists malicious use", while order-following AI is "AI that enables malicious use". The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once "enabling malicious use" includes serious cybercrime, not just naughty stories, I don't expect this political discussion to last very long: politically, it's a pretty basic "do you want every-person-for-themself anarchy, or the collective good?" question. However, depending on takeoff speeds, the timeline from "serious cybercrime enabled" to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-23T16:40:34.133Z · LW(p) · GW(p)

Please convince me I'm wrong.

(I've only skimmed for now but) here's a reason / framework which might help with things going well: https://aiprospects.substack.com/p/paretotopian-goal-alignment

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T17:46:02.535Z · LW(p) · GW(p)

There we go!

This type of scheme to split a rapidly-growing pie semi fairly will definitely help reduce the urge to strike first.

If proliferation continues unchecked, we'll have RSI-capable AGI in the hands of teenagers and other malcontents eventually. And they often have irrational urges to strike first :)

But this type of scheme might stabilize the situation amongst a few AGIs in different hands, allowing them to collectively enforce not creating more and proliferating further.

Replies from: bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-23T18:18:01.133Z · LW(p) · GW(p)

If proliferation continues unchecked, we'll have RSI-capable AGI in the hands of teenagers and other malcontents eventually. And they often have irrational urges to strike first :)

Contra teenagers and the like, I'm hopeful that very capable open-weights models get banned early enough or at least dangerous capabilities get neutered really well using research in the shape of Tamper-Resistant Safeguards for Open-Weight LLMs.

Might be tougher to deal with 'other malcontents' like perhaps some states (North Korea, Russia), especially if weights remain relatively easy to steal by state actors.

comment by Vladimir_Nesov · 2024-08-23T22:32:47.023Z · LW(p) · GW(p)

Even with very slow takeoff where AIs reformat the economy without there being superintelligence, peaceful loss of control due to rising economic influence of AIs seems more plausible (as a source of overturn in the world order) than human-centric conflict. Humans will gradually hand off more autonomy to AIs as they become capable of wielding it, and at some point most relevant players are themselves AIs. This mostly seems unlikely only because superintelligence makes humans irrelevant even faster and less consensually.

Pausing AI for decades, if it's not yet too late and so possible at all, doesn't require surveillance over things other than most advanced semiconductor manufacturing. But it does require pausing improvement in computing hardware and making all potential AI accelerators to be DRMed [LW(p) · GW(p)] so that by design they can only be used when the international treaty as a whole approves their use and can't be usurped for unilateral use by force, with hardware itself becoming useless without a regular supply of OTP certificates.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-26T20:58:42.975Z · LW(p) · GW(p)

Yes to all of the first paragraph. A caveat is that there's a big difference between humans remaining nominally in charge of an AGI-driven economy and not. If we're still technically in charge, we will retire (however many of us those in charge care to support; hopefull eventually quadrillions or so); if not, we'll be either entirely extinct or have a few of us maintained for historical interest by the new AGI overlords.

I see no way to meaningfully pause AI in time. We could possibly pause US progress with adequate fearmongering, but that would just make China get there first. That could be a good thing if they're more cautious, which it now seems they might very well be [LW(p) · GW(p)]. That would be only if Xi or whoever winds up in charge is not a sociopath. Which I have no idea about.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-08-27T05:18:17.381Z · LW(p) · GW(p)

Pausing for decades requires an international treaty powerful enough to keep advanced semiconductor manufacturing from getting into the hands of a faction that would defect on the pause. But it's already very distributed, one hears a lot about ASML, but the tools it produces are not the only crucial thing, other similarly crucial tools are exclusively manufactured in many other countries. So starting this process quickly shouldn't be too difficult from the technical side, the issue is deciding to actually do it and then sustaining it even as individual nations get enough time to catch up with all the details that go into semiconductor manufacturing (which could take actual decades). And this doesn't seem different in kind from controlling the means of manufacturing nuclear arms.

This doesn't work if the AI accelerators already in the wild (in quantities a single actor could amass) are sufficient for an AGI capable of fast autonomous unbounded research (designed through merely human effort), but this could plausibly go either way. And it requires any new AI accelerators to be built differently, so that it's not sufficient to physically obtain them in order to run arbitrary computations on them. This way, there isn't temptation to seize such accelerators by force, and so no need to worry about enforcing the pause at the level of physical datacenters.

Replies from: Seth Herd
comment by Seth Herd · 2024-09-05T20:19:36.016Z · LW(p) · GW(p)

Yes, the issue is deciding to actually do it. That might happen if you just needed the US and China. But I see no way that the signatories wouldn't defect even after they'd signed the treaty saying they wouldn't do it.

I have no expertise in hardware security but I'd be shocked if there was a way to prevent unauthorized use even with physical possession in technically skilled (nation-state level) hands.

The final problem is that we probably already have plenty of compute to create AGI once some more algorithmic improvements are discovered. Tracked sincce 2013, alogirithmic improvements have been roughly as fast for neural networks as hardware improvements, depending on how you do the math. Sorry I don't have the reference. In any case, algorithmic improvements are real and large, so hardware limitations alone won't suffice for that long. Human brain computational capacity is neither an upper nor lower bound on computation needed to reach superhuman digital intelligence.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-09-05T20:52:42.934Z · LW(p) · GW(p)

If you get certificate checking inside each GPU, and somehow make it have a persistent counter state (doesn't have to be a clock, just advance when the GPU operates) that can't be reset, then you can issue one-time certificates for the specific GPU for the specific range of states of its internal counter with asymmetric encryption, which can't be forged by examining the GPU. Most plausible ways around would be replay attacks that reuse old certificates while fooling the GPU into thinking it's in the past. But given how many transistors modern GPUs have, it should be possible to physically distribute the logic that implements certificate checking and the counter states, and make it redundant, so that sufficient tempering would become infeasible, at least at scale (for millions of GPUs).

Algorithmic advancements, where it makes sense to talk of them as quantitative, are not that significant. Transformer made scaling to modern levels possible at all, and there was maybe a 10x improvement in compute efficiency since then (Llama+MoE), most (not all) ingredients relevant to compute efficiency in particular were already there in 2017 and just didn't make it into the initial recipe. If there is a pause, there should be no advancement in fabrication process, instead the technical difficulty of advanced semiconductor manufacturing becomes the main lever of enforcement. More qualitative advancements like hypothetical scalable self-play for LLMs are different, but then if there is a few years to phase out unrestricted GPUs, there is less unaccounted-for compute for experiments and eventual scaling.

comment by RogerDearnaley (roger-d-1) · 2024-08-23T21:53:50.996Z · LW(p) · GW(p)

One element that needs to be remembered here is that each major participant in this situation will have superhuman advice. Even if these are "do what I mean and check" order-following AI, if they can forsee that an order will lead to disaster they will presumably be programmed to say so (not doing so is possible, but is a clearly a flawed design). So if it is reasonably obvious to anything superintelligent that both:


a) treating this as a zero-sum winner-take all game is likely to lead to a disaster, and

b) there is a cooperative non-zero-sum game approach whose outcome is likely to be better, for the median participant

then we can reasonably expect that all the humans involved will be getting that advice from their AIs, unless-and-until they order them to shut up.

This of course does not prove that both a) and b) are true, merely that is that were the case, we can be optimistic of an outcome better than the usual results of human short-sightedness.


The potential benefits of cheap superintelligence certainly provide some opportunity for this to be a non-zero-sum game; what's less clear is that having multiple groups of humans controlling multiple order-following AIs cooperating clearly improves that. The usual answer is that in research and the economy a diversity of approaches/competition increases the chances of success and the opportunities for cross-pollenization: whether that necessarily applies in this situation is less clear 

Replies from: Seth Herd
comment by Seth Herd · 2024-08-26T20:19:39.912Z · LW(p) · GW(p)

Absolutely. I mentioned getting advice briefly in this short article and a little more in Instruction-following AGI is easier... [LW · GW]

The problem in that case is that I'm not sure your b) is true. I certainly hope it is. I agree that it's unclear. That's why I'd like to get more analysis of a multipolar human-controlled ASI scenario. I don't think people have thought about this very seriously yet.

comment by [deleted] · 2024-08-23T14:05:38.115Z · LW(p) · GW(p)

I think "The first AGI probably won't perform a pivotal act" [LW · GW] is by far the weakest section. 

To start things off, I would predict a world with slow takeoff and personal intent-alignment [LW · GW] looks far more multipolar [LW · GW] than the standard Yudkowskian recursively self-improving singleton that takes over the entire lightcone in a matter of "weeks or hours rather than years or decades" [LW · GW]. So the title of that section seems a bit off because, in this world, what the literal first AGI does becomes much less important, since we expect to see other similarly capable AI systems get developed by other leading labs relatively soon afterwards anyway.

But, in any case, the bigger issue I have with the reasoning there is the assumption (inferred from statements like "the humans in charge of AGI may not have the chutzpah to even try such a thing") that the social response [LW · GW] to the development of general intelligence is going to be... basically muted? Or that society will continue to be business-as-normal in any meaningful sense? I would be entirely shocked if the current state of the world in which the vast majority of people have little knowledge of the current capabilities of AI systems and are totally clueless about the AI labs' race towards AGI were to continue past the point that actual AGI is reached.

I think intuitions of the type that "There's No Fire Alarm for Artificial General Intelligence" [LW · GW] are very heavily built around the notion of rapid takeoff that is so fast there might well be no major economic evidence [LW · GW] of the impact of AI before the most advanced systems become massively superintelligent. Or that there might not be massive rises in unemployment [LW · GW] negatively impacting many people who are trying to live through the transition to an eventual post-scarcity economy. Or that the ways people relate to AIs [LW · GW] or to one another will not be completely turned on their heads.

A future world in which we get pretty far along the way to no longer needing old OSs or programming languages [LW · GW] because you can get an LLM to write really good code for you, in which AI can write an essay better than most (if not all) A+ undergrad students, in which it can solve Olympiad math problems [LW · GW] better than all contestants and do research [LW · GW] better than a graduate student, in which deep-learning based lie detection technology actually gets good [LW(p) · GW(p)] and starts being used more and more, in which major presidential candidates are already using AI-generated imagery and causing controversies over whether others are using similar technology, in which the capacity to easily generate whatever garbage you request breaks the internet or fills it entirely with creepy AI-generated propaganda videos made by state-backed cults, is a world in which stability and equilibrium are broken. It is not a world in which [LW · GW] "normality" can continue, in the sense that governments and people keep sleepwalking through the threat posed by AI [? · GW].

I consider it very unlikely that such major changes to society can go by without the fundamental thinking around them changing massively, and without those who will be close to the top of the line of "most informed about the capabilities of AI" grasping the importance of the moment. Humans are social creatures who delegate most of their thinking on what issues should even be sanely considered to the social group around them; a world with slow takeoff is a world in which I expect massive changes to happen during a long enough time-span that public opinion shifts, dragging along with it both the Overton window and the baseline assumptions about what can/must be done about this. 

There will, of course, be a ton of complicating factors that we can discuss, such as the development of more powerful persuasive AI [LW · GW] catalyzing the shift of the world [LW(p) · GW(p)] towards insanity and inadequacy, but overall I do not expect the argument in this section to follow through.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T15:28:40.628Z · LW(p) · GW(p)

Edit: I very much agree with your arguments against sleepwalking and against the continuation of normality. I think the "inattentive world" hypothesis is all but disproven, and it still plays an outsized role in alignment thinking.

I don't think the arguments in that section depend on any assumption of normality or sleepwalking. And the multipolar scenario is the problem, so it can't be part of a solution. They do depend on people making nonoptimal decisions, which people do constantly.

So I think the arguments in that section are more general than you're hoping.

If those don't hold, what is the alternate scenario in which a multipolar world remains safe?

Replies from: faul_sname
comment by faul_sname · 2024-08-23T17:37:52.409Z · LW(p) · GW(p)

If those don't hold, what is the alternate scenario in which a multipolar world remains safe?

The choice of the word "remains" is an interesting one here. What is true of our current multipolar world which makes the current world "safe", but which would stop being true of a more advanced multipolar world? I don't think it can be "offense/defense balance" because nuclear and biological weapons are already far on the "offense is easier than defense" side of that spectrum.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T17:49:52.962Z · LW(p) · GW(p)

I agree that it should be phrased differently. One problem here is that AGI may allow victory without mutually assured destruction. A second is that it may proliferate far more widely than nukes or bioweapons have so far. People often speak of massively multipolar scenarios as a good outcome.

Good point about the word "remains". I'm afraid people see a "stable" situation - but logically that only extends for a few years until fully autonomously RSI-capable AGI and robotics is widespread, and any malcontent can produce offensive capabilities we can't yet imagine.

Replies from: faul_sname
comment by faul_sname · 2024-08-23T18:17:12.804Z · LW(p) · GW(p)

People often speak of massively multipolar scenarios as a good outcome.

I understand that inclination. Historically, unipolar scenarios do not have a great track record of being good for those not in power, especially unipolar scenarios where the one in power doesn't face significant risks to mistreating those under them. So if unipolar scenarios are bad, that means multipolar scenarios are good, right?

But "the good situation we have now is not stable, we can choose between making things a bit worse (for us personally) immediately and maybe not get catastrophically worse later, or having things remain good now but get catastrophically worse later" is a pretty hard pill to swallow. And is also an argument with a rich history of being ignored without the warned catastrophic thing happening.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T18:58:43.159Z · LW(p) · GW(p)

Excellent point that unipolar scenarios have been bad historically. I wrote about recognizing the validity of that concern recently in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours [LW · GW].

And good point that warnings of future catastrophe are likely to go unheeded because wolf has been cried in the past.

Although sometimes those things didn't happen precisely because the warnings were heeded.

In this case, we only need one or a few relatively informed actors to heed the call to prevent proliferation even if it's short-term risky.

comment by faul_sname · 2024-08-23T16:28:30.606Z · LW(p) · GW(p)

I think "pivotal act" is being used to mean both "gain affirmative control over the world forever" and "prevent any other AGI from gaining affirmative control of the world for the foreseeable future". The latter might be much easier than the former though.

comment by Noosphere89 (sharmake-farah) · 2024-08-23T16:10:04.076Z · LW(p) · GW(p)

I don't think your scenario works, maybe because I don't believe that the world is as offense advantaged as you say.

I think the closest domain where things are this offense biased is the biotech domain, and whie I do think biotech leading to doom is something we will eventually have to solve, I'm way less convinced of the assumption that every other domain is so offense advantaged that whoever goes first essentially wins the race.

That said, I'm worried about scenarios where we do solve alignment and get catastrophe anyways. though unlike your scenario, I expect no existential catastrophe to occur, since I do think that humanity's potential isn't totally lost.

My expectation, conditional on both alignment being solved and catastrophe still happening, is something close to this scenario by dr_s here:

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher [LW · GW]

While I don't agree with the claim that this is inevitable, I do think there's a real chance of this sort of thing happening, and it's probably one of those threats that could very well materialize if AI automates most of the economy, and that means humans are unemployed.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T16:15:47.239Z · LW(p) · GW(p)

I agree entirely with the points made in that post. AGI will only "transform" the economy temporarily. It will very soon replace the economy. That is an entirely separate concern.

If you don't think a multipolar scenario is as offense-advantaged as I've described, where do you think the argument breaks down? What defensive technologies are you envisioning that could counter the types of offensive strategies I've mentioned?

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-08-23T16:38:38.097Z · LW(p) · GW(p)

Okay, I'm not sure the argument breaks down, but my crux is that everyone else probably has an AGI, and my issue is similar to Richard Ngo's issue with ARA: the people ordering ARA have far fewer resources to put into attack compared to the defense's capability, and real-life wars, while advantaged to the attacker, isn't so offense advantaged that defense is pointless:

https://www.lesswrong.com/posts/xiRfJApXGDRsQBhvc/we-might-be-dropping-the-ball-on-autonomous-replication-and-1#hXwGKTEQzRAcRYYBF [LW(p) · GW(p)]

Replies from: Seth Herd
comment by Seth Herd · 2024-08-23T17:23:38.323Z · LW(p) · GW(p)

The issue is that, if you can hide, you can amass resources exponentially once you hit self-replicating production facilities and fully recursively self-improving AGI. This almost completely shifts the logic of all previous conflicts.

The comment you link seems to be addressing a very different scenario than my primary concern. It's addressing an attack from within human infrastructure, rather than outside. What I describe is often not considered, because it seems like the "far future" that we needn't worry about yet. But that far future seems realistically to be a handful of years past human-level AGI that starts to rapidly develop new technologies like the robotics needed for an autonomous self-replicating production in remote locations.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-08-23T17:36:26.273Z · LW(p) · GW(p)

Then it reduces to "I think the exponential growth of resources is avaliable to both the attackers and defense, such that even while everything is changing, the relative standing of the attack/defense balance doesn't change."

I think part of why I'm skeptical is the assumption that exponential growth is only useful for attack, or at least way more useful for attack, whereas I think exponentially growing resources by AI tech is way more symmetrical by default.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-26T20:39:37.075Z · LW(p) · GW(p)

Ah - now I see your point. This will help me clarify my concern in future presentations, so thanks!

My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive.

Beyond that, I'm afraid the physics of the world does favor offense over defense. It's pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova.

But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-08-26T21:01:26.131Z · LW(p) · GW(p)

Yeah, it does deserve more careful thought, especially since I expect almost all of my probability mass on catastrophe to be human caused, and more importantly I still think that it's an important enough problem that resources should go to thinking about it.

comment by eggsyntax · 2024-08-30T22:23:33.603Z · LW(p) · GW(p)

(Posting this initial comment without having read the whole thing because I won't have a chance to come back to it today; apologies if you address this later or if it's clearly addressed in a comment)

If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving.

It seems worth spelling out your view here on how RSI-capable early AGI is likely to be. I would expect that early AGI will be capable of RSI in the weak sense of being able to do capabilities research and help plan training runs, but not capable of RSI in the strong sense of being able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities.

I think this matters for your scenario, because the weaker form of RSI still requires either a large cluster of commercial GPUs (which seems hard to do secretly / privately), or ultra-high-precision manufacturing capabilities, which we know are extremely difficult to achieve at human-level intelligence.

Replies from: Seth Herd, Vladimir_Nesov
comment by Seth Herd · 2024-08-31T00:54:26.642Z · LW(p) · GW(p)

Great point. I definitely mean fully capable of recursive self-improvement - that is, needing no humans in the loop. This lengthens the timelines to at least when we have roughly human-level robotics that are commercially available- but I expect that to be ten years or less.

The hardware requirements for early AGI are another factor in the timeline before this RSI-catastrophe is possible. Let's remember that algorithmic progress is roughly as fast as hardware progress to date, so that will also cease to be a large limitation all too soon.

The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.

Replies from: eggsyntax
comment by eggsyntax · 2024-09-02T21:53:34.382Z · LW(p) · GW(p)

Got it. I think I personally expect a period of at least 2-3 years when we have human-level AI (~'as good as or better than most humans at most tasks') but it's not capable of full RSI.

It also seems plausible to me that strong RSI in the sense I use it above ('able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities') may take a long time to develop or even require already-superhuman levels of intelligence. As a loose demonstration of that possibility, the best team of neurosurgeons etc in the world couldn't currently operate on someone's brain to give them greater intelligence, even if they had tools that let them precisely edit individual neurons and connections. I'm certainly not confident that's much too hard for human-level AI, but it seems plausible.

The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.

That seems highly plausible to me too; my mainline guess is that by default, given human-level AI, it rapidly proliferates as replacement employees and for other purposes until either there's a sufficiently large catastrophe, or it improves to superhuman. 

comment by Vladimir_Nesov · 2024-08-31T01:19:52.980Z · LW(p) · GW(p)

capable of RSI in the weak sense of being able to do capabilities research and help plan training runs

The speed at which this kind of thing is possible is crucial, even if capabilities are not above human level. This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn't require waiting for currently plannable training runs becomes useful, as well as things like rewriting all the software, even if models themselves can't be "manually" retrained as part of this process. Plausibly at some point in the theoretical research you unlock online learning, even the kind that involves gradually shifting to a different architecture, and the inconvenience of distinct training runs disappears.

So this weak RSI would either need to involve AIs that can't autonomously research, but can help the researchers or engineers, or the AIs need to be sufficiently slow and non-superintelligent that they can't run through decades of research in months.

Replies from: eggsyntax
comment by eggsyntax · 2024-09-02T22:06:48.672Z · LW(p) · GW(p)

This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn't require waiting for currently plannable training runs becomes useful

It doesn't seem clear to me that this is the case; there isn't necessarily a faster way to precisely predict the behavior and capabilities of a new model than training it (other than crude measures like 'loss on next-token prediction continues to decrease as the following function of parameter count').

It does seem possible and even plausible, but I think our theoretical understanding would have to improve enormously in order to make large advances without empirical testing.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-09-02T22:15:08.049Z · LW(p) · GW(p)

I mean theoretical research on more general topics, not necessarily directly concerned with any given training run or even with AI. I'm considering the consequences of there being an AI that can do human level research in math and theoretical CS at much greater speed than humanity. It's not useful when it's slow, so that the next training run will make what little progress is feasible irrelevant, in the same way they don't currently train frontier models for 2 years, since a bigger training cluster will get online in 1 and then outrun the older run. But with sufficient speed, catching up on theory from distant future can become worthwhile.

Replies from: eggsyntax
comment by eggsyntax · 2024-09-02T23:27:34.746Z · LW(p) · GW(p)

Oh, I see, I was definitely misreading you; thanks for the clarification!

comment by eggsyntax · 2024-08-30T22:10:03.188Z · LW(p) · GW(p)

If no pivotal act is performed, RSI-capable AGI proliferates


Minor suggestion: spell out 'recursive self-improvement (RSI)' the first time; it took me a minute to remember the acronym.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-31T01:00:11.524Z · LW(p) · GW(p)

Good idea, done.

comment by Dakara (chess-ice) · 2024-08-28T15:50:17.603Z · LW(p) · GW(p)

I wonder, have the comments managed to alleviate your concerns at all? Are there any promising ideas for multipolar AGI scenarios? Were there any suggestions that could work?

Replies from: Seth Herd
comment by Seth Herd · 2024-08-28T18:37:18.891Z · LW(p) · GW(p)

Great question. I was thinking of adding an edit to the end of the post with conclusions based on the comments/discussion. Here's a draft:

None of the suggestions in the comments seemed to me like workable ways to solve the problem.

I think we could survive an n-way multipolar scenario if n is small - like a handful of ASIs controlled by a few different governments. But not indefinitely - unless those ASIs come up with coordination strategies no human has yet thought of (or argued convincingly enough that I've heard of it - this isn't really my area, but nobody has pointed to any strong possibilities in the comments).

So my conclusion was more on the side that it's going to be so obviously such a bad/dangerous scenario that it won't be allowed to happen.

Basically, the hope is that this all becomes viscerally obvious to the first people who speak with a superhuman AGI and who think about global politics. They'll pull their shit together, as humans sometimes do when they're motivated to actually solve hard problems.

Here's one scenario in which multipolarity is stopped. Similar scenarios apply if the number of AGIs is small and people coordinate well enough to use their small group of AGIs similarly to what I'll describe below.

The people who speak to the first AGIi(s) and realize what must be done will include people in the government, because of course they'll be demanding to be included in decisions about using AGI. They'll talk sense to leadership, and the government will declare that this shit is deathly dangerous, and that nobody else should be building AGI.

They'll call for a voluntary global moratorium on AGI projects. Realizing that this will be hugely unpopular, they'll promise that the existing AGI will be used to benefit the whole world. They'll then immediately deploy that AGI to identify and sabotage projects in other countries. If that's not adequate, they'll use minimal force. False-flag operations framing anti-AGI groups might be used to destroy infrastructure and assassinate key people involved in foreign projects. Or who knows.

The promise to benefit the whole world will be halfway kept. The AGI will be used to develop military technology and production facilities for the government that controls it; but it will simultaneously be used to develop useful technologies that aid the problems most pressing for other governments. That could be useful tool AI, climate geoengineering, food production, etc.

The government controlling AGI keeps their shit together enough that no enterprising sociopath seizes personal control and anoints themselves god-emperor for eternity. They realize that this will happen eventually if their now-ASI keeps following human orders. They use its now-well-superhuman intelligence to solve value alignment sufficiently well to launch it or a successor as a fully autonomous sovereign ASI.

Humanity prospers under their sole demigod until the heat death of the universe, or an unaligned expansion AGI crosses our lightcone and turns everyone into paperclips. It will be a hell of a party for a subjectively very long time indeed. The one unbreakable rule will be that thou shalt worship no other god. All of humanity everywhere is monitored by copies of the sovereign AGI to prevent them building new AGI that aren't certified-aligned by the servant-god ASI. But since it's aligned and smart, it's pretty cool about the whole thing. So nobody minds that one rule a lot, given how much fun they're building everything and having every experience imaginable within the consent of all sentient entities involved.

I'd love to get more help thinking about how likely the central premise, that people get their shit together once they're staring real AGI in the face is. And what we can do now to encourage that.

comment by Charlie Steiner · 2024-08-24T11:43:44.417Z · LW(p) · GW(p)

This strikes me as defining "alignment" a little differently than me.

It even might defing "instruction-following" differently than me.

If we really solved instruction following, you could give the instruction "Do the right thing" and it would just do the right thing.

If you that's possible, then what we need is a coalition to tell powerful AIs to "do the right thing", rather than "make my creators into god-emperors" or whatever. This seems doable, though the clock is perhaps ticking.

If you can't just tell an AI to do the right thing, but it's still competent enough to pull off dangerous plans, then to me this still seems like the usual problem of "powerful AI that's not trying to do good is bad" whether or not a human is giving instructions to this AI.

Or to rephrase this as a call to action: AI alignment researchers cannot just hill-climb on making AIs that follow arbitrary instructions. We have to preferentially advance AIs that do the right thing, to avoid the sort of scenario you describe.

Replies from: Seth Herd, sharmake-farah, ann-brown
comment by Seth Herd · 2024-08-25T00:19:30.058Z · LW(p) · GW(p)

I actually completely agree with this call to action. 

Unfortunately, I suspect that it's impossible to make value alignment easier than personal intent alignment. I can't think of a technical alignment approach that couldn't be used both ways equally well. And worse than that, I think that intent aligned AGI is easier than value aligned AGI for reasons I outline in that post, and Max Harms has elaborated in much more detail in Corrigibility as Singular Target sequence (as well as Paul Christiano and many others' arguments.

But I still agree with your call to action: we should be working now to make value alignment as safe as possible. That requires deciding what we align to. The concept of humanity is not well-defined in the future, when upgrades and digital copies of human minds become possible. Roger Dearnaley's sequence AI, alignment, and ethics [? · GW] lays out these problems and more; for instance, if we stick to baseline humans, the future will be largely controlled by whatever values are held by the most humans, in a competition for memes and reproduction. So there's conceptual as well as technical/mind-design work to be done on technical alignment.

And that work should be done. In multipolar scenarios with, someone may well decide to "launch" their AGI to be autonomous with value alignment, out of magnanimity or desperation. We'd better make their odds of success as high as we can manage.

I don't think refusing to work on intent alignment is a helpful option. It will likely be tried, with or without our help. Following instructions is the most obvious alignment target for any agent that's even approaching autonomy and therefore usefulness. Thinking about how to make those attempts successful will also increase our odds of surviving the first competent autonomous AGIs.

WRT definitions: alignment doesn't specify alignment with whom. I think this ambiguity is causing important confusions in the field.

I was trying to draw a distinction between two importantly different alignment goals, which I'm terming personal intent alignment and value alignment until better terminology comes along. More on that in an upcoming post.

If you did have an AGI that follows instructions and you told it "do the right thing", you'd have to specify right for who.

And during the critical risk period, that AGI wouldn't know for sure what the right thing was. We don't expect godlike intelligence right out of the gate. It won't know whether a risky takeover/pivotal act is the right move. If the situation is multipolar, it won't know even as it becomes truly superintelligent, because it will have to guess at the plans, technologies, and capabilities of other superintelligent AGI.

My call to action is this: help me understand and make or break the argument that a multipolar scenario is very bad, so that the people in charge of the first really successful AGI project know the stakes when they make their calls.

comment by Noosphere89 (sharmake-farah) · 2024-08-24T16:48:22.412Z · LW(p) · GW(p)

The problem is that "do the right thing" makes no sense without a reference to what values, or more formally what utility functions the human in question has, so there's no way to do what you propose to do even in theory, at least without strong assumptions on their values/utility functions.

Also, it breaks corrigiblity, and in many applications like military AI, this is a dangerous property to break, because you probably want to change their orders/actions, and this sort of anti-corrigiblity is usually bad unless you're very confident value learning works, which I don't share.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2024-08-24T18:59:37.823Z · LW(p) · GW(p)

All language makes no sense without a method of interpretation. "Get me some coffee" is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what "get me some coffee" entails without it being hardcoded in?

To say it's impossible in theory is to set the bar so high that humans using language is also impossible.

As for military use of AGI, I think I'm fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven't solved alignment by my lights and building powerful AI is probably bad.

Replies from: sharmake-farah, Seth Herd
comment by Noosphere89 (sharmake-farah) · 2024-08-24T19:11:13.098Z · LW(p) · GW(p)

I think the biggest difference I have here is that I don't think there is that much pressure to converge to a single value, or even that small of a space of values, at least in the multi-agent case, unlike in your communication examples, and I think the degrees of freedom for morality is pretty wide/large, unlike in the case of communication, where there is a way for even simple RL agents to converge on communication/language norms (at least in the non-adversarial case).

At a meta level, I'm more skeptical of value learning, especially the ambitious variant of value learning being a good first target than you seem to have, and think corrigibility/DWIMAC goals tend to be better than you think it does, primarily because I think the arguments for alignment dooming us has holes that make them not go through.

Replies from: Vladimir_Nesov, Charlie Steiner
comment by Vladimir_Nesov · 2024-08-24T20:30:46.720Z · LW(p) · GW(p)

Strong optimization doesn't need to ignore boundaries and tile the universe with optimal stuff according to its own aesthetics, disregarding the prior content of the universe (such as other people). The aesthetics can be about how the prior content is treated, the full trajectory it takes over time, rather than about what ends up happening after the tiling regardless of prior content.

The value of respect for autonomy doesn't ask for values of others to converge, doesn't need to agree with them to be an ally. So that's an example of a good thing in a sense that isn't fragile [LW(p) · GW(p)].

Replies from: Seth Herd
comment by Seth Herd · 2024-08-25T00:26:24.409Z · LW(p) · GW(p)

This is true; value alignment is quite possible. But if it's both harder/less safe, and people would rather align their godling with their own values/commands, I think we should either expect this or make very strong arguments against it.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-08-25T01:27:00.214Z · LW(p) · GW(p)

Respect for autonomy is not quite value alignment, just as corrigibility is not quite alignment. I'm pointing out that it might be possible to get a good outcome out of strong optimization without value alignment, because strong optimization can be sensitive to context of the past and so doesn't naturally result in a past-insensitive tiling of the universe according to its values. Mostly it's a thought experiment investigating some intuitions about what strong optimization has to be like, and thus importance and difficulty of targeting it precisely at particular values.

Not being a likely outcome is a separate issue, for example I don't expect intent alignment in its undifferentiated form to remain secure enough to contain AI-originating agency. To the extent intent alignment grants arbitrary wishes, what I describe is an ingredient of a possible wish, one that's distinct from value alignment and sidesteps the question of "alignment to whom" in a way different from both CEV and corrigibility. It's not more clearly specified than CEV either, but it's distinct from it.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-25T19:32:22.881Z · LW(p) · GW(p)

In your use of respect for autonomy as a goal:; are you referring to something like Empowerment is (almost) All We Need [LW · GW]? I do find that to be an appealing alignment target (I think I'm using alignment slightly more broadly, as in Hubinger's definition. [LW · GW] (I have a post in progress on the terminology of different alignment/goal targets and resulting confusions).

The problem with empowerment as an ASI goal is, once again: empowering whom? And do you empower them to make more like them that you then have to empower? Roger Dearnaley notes that if we empower everyone, humans will probably lose out to either something with less volition but using fewer resources, like insects, or something with more volition to empower, like other ASIs. Do we reallly want to limit the future to baseline humans? And how do we handle humans that want to create tons more humans?

See 4. A Moral Case for Evolved-Sapience-Chauvinism [LW · GW] and 5. Moral Value for Sentient Animals? Alas, Not Yet [LW · GW] from Roger's AI, Alignment, and Ethics sequence.

I actually do expect intent alignment to remain secure enough to contain AI-originating agency, as long as it's the primary goal or "'singular target". It's counterintuitive that a superintelligent being could want nothing more than to do what its principal wants it to do, but I think it's coherent. And the more competent it gets, the better it will be at doing what you want and nothing more. Before it's that competent, the principal can give more careful instructions, including instructions to check before acting, and to help with its alignment in various ways.

I agree that respect for autonomy/empowerment is one instruction/intent you could give. I do expect that someone will turn their intent-aligned AGI into an autonomous AGI at some point; hopefully after they're quite confident in its alignment and the worth of that goal.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-08-27T04:55:50.770Z · LW(p) · GW(p)

Respect for autonomy is not quite empowerment, it's more like being left alone. The use of this concept is more in defining what it means for an agent or a civilization to develop relatively undisturbed, without getting overwritten by external influence, not in considering ways of helping it develop. So it's also a building block for defining extrapolated volition, because that involves extended period of not getting destroyed by external influences. But it's conceptually prior to extrapolated volition, it doesn't depend on already knowing what it is, it's a simpler notion.

It's not by itself a good singular target to set an AI to pursue, for example it doesn't protect humans from building more extinction-worthy AIs within their membranes, and doesn't facilitate any sort of empowerment. But it seems simple enough and agreeable as a universal norm to be a plausible aspect of many naturally developing AI goals, and it doesn't require absence of interaction, so allows empowerment etc. if that is also something others provide.

comment by Charlie Steiner · 2024-08-25T16:20:54.766Z · LW(p) · GW(p)

Yeah, I agree with your first paragraph. But I think it's a difference of degree rather than kind. "Do the right thing" is still communication, it's just communication about something indirect, that we nonetheless should be picky about.

comment by Seth Herd · 2024-08-26T20:51:07.617Z · LW(p) · GW(p)

I considered titling a different version of this post "we need to also solve the human alignment problem" or something similar.

comment by Ann (ann-brown) · 2024-08-24T13:24:13.439Z · LW(p) · GW(p)

Perhaps seemingly obvious, but given some of the reactions around Apple putting "Do not hallucinate" into the system prompt of its AI ...

If you do get an instruction-following AI that you can simply give the instruction, "Do the right thing", and it would just do the right thing:

Remember to give the instruction.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-25T00:22:54.619Z · LW(p) · GW(p)

You have to specify the right thing for whom. And the AGI won't know what it is for sure, in a realistic slow takeoff during the critical risk period. See my reply to Charlie above.

But yes, using the AGIs intelligence to help you issue good instrctions is definitely a good idea. See my Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] for more logic on why.

Replies from: ann-brown
comment by Ann (ann-brown) · 2024-08-25T01:07:14.477Z · LW(p) · GW(p)

All non-omniscient agents make decisions with incomplete information. I don't think this will change at any level of takeoff.

Replies from: Seth Herd
comment by Seth Herd · 2024-08-25T19:33:58.850Z · LW(p) · GW(p)

Sure, but my point here is that AGI will be only weakly superhuman during the critical risk period, so it will be highly uncertain, and probably human judgment is likely to continue to play a large role. Quite possibly to our detriment.