If we solve alignment, do we die anyway?

seth-herd

If we solve alignment, do we die anyway?

post by Seth Herd · 2024-08-23T13:13:10.933Z · LW · GW · 129 comments

  The first AGIs will probably be aligned to take orders
  The first AGI probably won't perform a pivotal act
  So RSI-capable AGI may proliferate until a disaster occurs
  Counterarguments/Outs
    Please convince me I'm wrong. Or make stronger arguments that this is right.
  (Edit:) Conclusions after discussion
None
129 comments

Epistemic status: I'm aware of good arguments that this scenario isn't inevitable, but it still seems frighteningly likely even if we solve technical alignment. Clarifying this scenario seems important.

TL;DR: (edits in parentheses, two days after posting, from discussions in comments )

If we solve alignment, it will probably be used to create AGI that follows human orders.
If takeoff is slow-ish, a pivotal act that prevents more AGIs from being developed will be difficult (risky or bloody).
If no pivotal act is performed, AGI proliferates. (It will soon be capable of recursive self improvement (RSI)) This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, probably wins (by hiding and improving intelligence and offensive capabilities at a fast exponential rate).
Disaster results. (Extinction or permanent dystopia are possible if vicious humans order their AGI to attack first while better humans hope for peace.)
(Edit later: After discussion and thought, the above seems so inevitable and obvious that the first group(s) to control AGI(s) will probably attempt a pivotal act before fully RSI-capable AGI proliferates, even if it's risky.)

The first AGIs will probably be aligned to take orders

People in charge of AGI projects like power. And by definition, they like their values somewhat better than the aggregate values of all of humanity. It also seems like there's a pretty strong argument that Instruction-following AGI is easier than value aligned AGI. In the slow-ish takeoff we expect, this alignment target seems to allow for error-correcting alignment, in somewhat non-obvious ways. If this argument holds up even weakly, it will be an excuse for the people in charge to do what they want to anyway.

I hope I'm wrong and value-aligned AGI is just as easy and likely. But it seems like wishful thinking at this point.

The first AGI probably won't perform a pivotal act

In realistically slow takeoff scenarios, the AGI won't be able to do anything like make nanobots to melt down GPUs. It would have to use more conventional methods, like software intrusion to sabotage existing projects, followed by elaborate monitoring to prevent new ones. Such a weak attempted pivotal act could fail, or could escalate to a nuclear conflict.

Second, the humans in charge of AGI may not have the chutzpah to even try such a thing. Taking over the world is not for the faint of heart. They might get it after their increasingly-intelligent AGI carefully explains to them the consequences of allowing AGI proliferation, or they might not. If the people in charge are a government, the odds of such an action go up, but so do the risks of escalation to nuclear war. Governments seem to be fairly risk-taking. Expecting governments to not just grab world-changing power while they can seems naive [LW(p) · GW(p)], so this is my median scenario.

So RSI-capable AGI may proliferate until a disaster occurs

If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving. How long until someone tells their AGI to hide, self-improve, and take over?

Many people seem optimistic about this scenario. Perhaps network security can be improved with AGIs on the job. But AGIs can do an end-run around the entire system: hide, set up self-replicating manufacturing (robotics is rapidly improving to allow this), use that to recursively self-improve your intelligence, and develop new offensive strategies and capabilities until you've got one that will work within an acceptable level of viciousness.^[1]

If hiding in factories isn't good enough, do your RSI manufacturing underground. If that's not good enough, do it as far from Earth as necessary. Take over with as little violence as you can manage or as much as you need. Reboot a new civilization if that's all you can manage while still acting before someone else does.

The first one to pull the stops probably wins. This looks all too much like a non-iterated Prisoner's Dilemma with N players - and N increasing.

Counterarguments/Outs

For small numbers of AGI and similar values among their wielders, a collective pivotal act could be performed. I place some hopes here, particularly if political pressure is applied in advance to aim for this outcome, or if the AGIs come up with better cooperation structures and/or arguments than I have.

The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described. We've survived that so far- but with only nine participants to date.

One means of preventing AGI proliferation is universal surveillance by a coalition of loosely cooperative AGI (and their directors). That might be done without universal loss of privacy if a really good publicly encrypted system were used, as Steve Omohundro suggests [LW(p) · GW(p)], but I don't know if that's possible. If privacy can't be preserved, this is not a nice outcome, but we probably shouldn't ignore it.

The final counterargument is that, if this scenario does seem likely, and this opinion spreads, people will work harder to avoid it, making it less likely. This virtuous cycle is one reason I'm writing this post including some of my worst fears.

Please convince me I'm wrong. Or make stronger arguments that this is right.

I think we can solve alignment, at least for personal-intent alignment [LW · GW], and particularly for the language model cognitive architectures [AF · GW] that may well be our first AGI [LW · GW]. But I'm not sure I want to keep helping with that project until I've resolved the likely consequences a little more. So give me a hand?

(Edit:) Conclusions after discussion

None of the suggestions in the comments seemed to me like workable ways to solve the problem.

I think we could survive an n-way multipolar human-controlled ASI scenario if n is small - like a handful of ASIs controlled by a few different governments. But not indefinitely - unless those ASIs come up with coordination strategies no human has yet thought of (or argued convincingly enough that I've heard of it - this isn't really my area, but nobody has pointed to any strong possibilities in the comments). I'd love more pointers to coordination strategies that could solve this problem.

So my conclusion is to hope that this is so obviously such a bad/dangerous scenario that it won't be allowed to happen.

Basically, my hope is that this all becomes viscerally obvious to the first people who speak with a superhuman AGI and who think about global politics. I hope they'll pull their shit together, as humans sometimes do when they're motivated to actually solve hard problems.

I hope they'll declare a global moratorium on AGI development and proliferation, and agree to share the benefits of their AGI/ASI broadly in hopes that this gets other governments on board, at least on paper. They'd use their AGI to enforce that moratorium, along with hopefully minimal force. Then they'll use their intent-aligned AGI to solve value alignment and launch a sovereign ASI before some sociopath(s) gets ahold of the reins of power and creates a permanent dystopia of some sort.

More on this scenario in my reply below. [LW(p) · GW(p)]

I'd love to get more help thinking about how likely the central premise, that people get their shit together once they're staring real AGI in the face is. And what we can do now to encourage that.

Additional edit: Eli Tyre and Steve Byrnes have reached similar conclusions by somewhat different routes. More in a final footnote.^[2]

^{^}
Some maybe-less-obvious approaches to takeover, in ascending order of effectiveness: Drone/missile-delivered explosive attacks on individuals controlling and data centers housing rival AGI; Social engineering/deepfakes to set off cascading nuclear launches and reprisals; dropping stuff from orbit or altering asteroid paths; making the sun go nova.
The possibilities are limitless. It's harder to stop explosions than to set them off by surprise. A superintelligence will think of all of these and much better options. Anything more subtle that preserves more of the first actors' near-term winnings (earth and humanity) is gravy. The only long-term prize goes to the most vicious.
^{^}
Eli Tyre reaches similar conclusions with a more systematic version of this logic in Unpacking the dynamics of AGI conflict that suggest the necessity of a premptive pivotal act [LW · GW]:
Overall, the need for a pivotal act depends on the following conjunction / disjunction.
The equilibrium of conflict involving powerful AI systems lands on a technology / avenue of conflict which are (either offense dominant, or intelligence-advantage dominant) and can be developed and deployed inexpensively or quietly.
Unfortunately, I think all three of these are very reasonable assumptions about the dynamics of AGI-fueled war. The key reason is that there is adverse selection on all of these axes.
Steve Byrnes reaches similar conclusions in What does it take to defend the world against out-of-control AGIs? [LW · GW], but he focuses on near-term, fully vicious attacks from misaligned AGI, prior to fully hardening society and networks, centering on triggering full nuclear exchanges. I find this scenario less likely because I expect instruction-following alignment to mostly work on the technical level, and the first groups to control AGIs to avoid apocalyptic attacks.
I have yet to find a detailed argument that addresses these scenarios and reaches opposite conclusions.

129 comments

Comments sorted by top scores.

comment by johnswentworth · 2024-08-23T13:54:36.139Z · LW(p) · GW(p)

If takeoff is slow-ish, a pivotal act (preventing more AGIs from being developed) will be difficult.
If no pivotal act is performed, RSI-capable AGI proliferates. This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, wins.

These two points seem to be in direct conflict. The sorts of capabilities and winner-take-all underlying dynamics which would make "the first to attack wins" true are also exactly the sorts of capabilities and winner-take-all dynamics which would make a pivotal act tractable.

Or, to put it differently: the first "attack" (though might not look very "attack"-like) is the pivotal act; if the first attack wins, that means the pivotal act worked, and therefore wasn't that difficult. Conversely, if a pivotal act is too hard, then even if an AI attacks first and wins, it has no ability prevent new AI from being built and displacing it; if it did have that ability, then the attack would be a pivotal act.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-23T15:21:28.497Z · LW(p) · GW(p)

Yes; except that a successful act can still be quite difficult.

You could reframe the concern to be that pivotal acts in a slow takeoff are prone to be bloody and dangerous. And because they are, and humans are likely to retain control, a pivotal act may be put off until it's even more bloody - like a nuclear conflict or sending the sun nova.

Worse yet, the "pivotal act" may be performed by the worst (human) actor, not the best.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-25T05:47:05.500Z · LW(p) · GW(p)

Just to elaborate a little:

You are right that the same capabilities enable a pivotal act. My concern is that they won't be used for one (where pivotal act is defined as a good act).

Having thought about it some more, I think the biggest problem in the multipolar, human-controlled RSI-capable AGI scenario is that it tends to be the worst actor that defects first and controls the future.

More ethical humans will tend to be more timid with committing or risking mass destruction to achieve their ends, so they'll tend to hold off on aggressive moves that could win.

"Hide and create a superbrain and a robot army" are not the first things a good person tells their AGI to do, let alone inducing nuclear strikes that increase one's odds of winning at great cost. Someone with more selfish designs on the future may have much less trouble issuing those orders.

comment by sweenesm · 2024-08-23T14:41:15.565Z · LW(p) · GW(p)

Thanks for writing this, I think it's good to have discussions around these sorts of ideas.

Please, though, let's not give up on "value alignment," or, rather, conscience guard-railing, where the artificial conscience is inline with human values.

Sometimes when enough intelligent people declare something's too hard to even try at, it becomes a self-fulfilling prophesy - most people may give up on it and then of course it's never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we're really not sure if it'll be as hard as it seems.

Replies from: otto-barten, Seth Herd, Seth Herd

↑ comment by otto.barten (otto-barten) · 2025-01-06T16:08:05.674Z · LW(p) · GW(p)

I don't think value alignment of a super-takeover AI would be a good idea, for the following reasons:

1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.
2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it's very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can't correct for externalities happening down the road. (Speed also makes it more likely that we can't correct in time, so I think we should try to go slow).
3) There is no agreement on which values are 'correct'. Personally, I'm a moral relativist, meaning I don't believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It's very uncertain whether such change would be considered as net positive by any surviving humans.
4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.

I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I'm somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI's input.

Replies from: sweenesm

↑ comment by sweenesm · 2025-01-07T16:36:39.181Z · LW(p) · GW(p)

Thanks for the comment. I think people have different conceptions of what “value aligning” an AI means. Currently, I think the best “value alignment” plan is to guardrail AI’s with an artificial conscience [EA · GW] that approximates an ideal human conscience (the conscience of a good and wise human). Contained in our consciences are implicit values, such as those behind not stealing or killing except maybe in extreme circumstances.

A world in which “good” transformative AI agents have to autonomously go on the defensive against “bad” transformative AI agents seems pretty inevitable to me right now. I believe that when this happens, if we don’t have some sort of very workable conscience module in our “good” AI’s, the collateral damage of these “clashes” is going to be much greater than it otherwise would be. Basically what I’m saying is yes, it would be nice if we didn’t need to get “value alignment” of AI’s “right” under a tight timeline, but if we want to avoid some potentially huge bad effects in the world, I think we do.

To respond to some of your specific points:

I’m very unsure about how AI’s will evolve, so I don’t know if their system of ethics/conscience will end up being locked in or not, but this is a risk. This is part of why I’d like to do extensive testing and iterating to get an artificial conscience system as close to “final” as possible before it’s loaded into an AI agent that’s let loose in the world. I’d hope that the system of conscience we’d go with would support corrigibility so we could shut down the AI even if we couldn’t change its conscience/values.
I’m sure there will be plenty of unforeseen consequences (or “externalities”) arising from transformative AI, but if the conscience we load into AI’s is good enough, it should allow them to handle situations we’ve never thought of in a way that wise humans might do - I don’t think wise humans need to update their system of conscience with each new situation, they just have to suss out the situation to see how their conscience should apply to it.
I don’t know if there are moral facts, but something that seems to me to be on the level of a fact is that everyone cares about their own well-being - everyone wants to feel good in some way. Some people are very confused about how to go about doing this and do self-destructive acts, but ultimately they’re trying to feel good (or less bad) in some way. And most people have empathy, so they feel good when they think others feel good. I think this is the entire basis from which we should start for a universal, not-ever-gonna-change human value: we all want to feel good in some way. Then it’s just a question of understanding the “physics” of how we work and what makes us feel the most overall good (well-being) over the long-term. And I put forward the hypothesis that raising self-esteem is the best heuristic for raising overall well-being, and further, that increasing our responsibility level is the path to higher self-esteem (see Branden for the conception of “self-esteem” I’m talking about here).
I also consider AI’s replacing all humans to be an extremely bad outcome. I think it’s a result that someone with an “ideal” human conscience would actively avoid bringing about, and thus an AI with an artificial conscience based on an ideal human conscience (emphasizing responsibility) should do the same.

Ultimately, there’s a lot of uncertainty about the future, and I wouldn’t write off “value alignment” in the form of an artificial conscience just yet, even if there are risks involved with it.

Replies from: otto-barten

↑ comment by otto.barten (otto-barten) · 2025-01-17T07:15:32.817Z · LW(p) · GW(p)

Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you're trying to do, for clarity. I'm happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I've talked to people into value alignment of ASI who said they "would bite that bullet", in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I'm also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)

If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I'd file it under a positive offense defense balance, which would be great. If humanity doesn't support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.

Replies from: sweenesm

↑ comment by sweenesm · 2025-01-17T16:30:46.539Z · LW(p) · GW(p)

Yes, I think referring to it as “guard-railing with an artificial conscience” would be more clear than saying “value aligning,” thank you.

I believe that if there were no beings around who had real consciences (with consciousness and the ability to feel pain as two necessary pre-requisites to conscience), then there’d be no value in the world. No one to understand and measure or assign value means no value. And any being that doesn’t feel pain can’t understand value (nor feel real love, by the way). So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake. We most likely either got the artificial conscience wrong because that would’ve implicitly valued human life so wouldn’t have let a guard-railed AI wipe out humans, or we didn’t get an artificial conscience on board enough AI’s in time. An AI that had a “real” conscience also wouldn’t wipe out humans against the will of humans.

The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point. If literally everyone in the world said, “Hey, we all want to die,” then the guard-railed AI, if it thought the people were in their “right mind,” would respect their wishes and let them die.

All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.

Replies from: otto-barten

↑ comment by otto.barten (otto-barten) · 2025-01-18T08:05:19.759Z · LW(p) · GW(p)

So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake

Again, I'm glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.

The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point.

I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn't recognize the existence of such risks. The AI will proceed sabotaging people's unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.

I think people underestimate the amount of pushback we're going to get once you get into pivotal act territory. That's why I think it's hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.

All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.

So yes definitely agree with this. I don't think lack of conscience or ethics is the issue though, but existential risk awareness.

Replies from: sweenesm

↑ comment by sweenesm · 2025-01-18T13:22:58.136Z · LW(p) · GW(p)

In terms of doing a pivotal act (which is usually thought of as preemptive, I believe) or just whatever defensive acts were necessary to prevent catastrophe, I hope the AI would be advanced enough to make decent predictions of what the consequences of its actions could be in terms of losing “political capital,” etc., and then it would make its decisions strategically. Personally, if I had the opportunity to save the world from nuclear war, but everyone was going to hate me for it, I’d do it. But then, it wouldn’t matter that I lost the ability to affect anything after that like it would for a guard-railed AI that could do a huge amount of good after that if it weren’t shunned by society. Improving humans’ consciences and ethics would hopefully help avoid them hating the AI for saving them.

Also, if there were enough people, especially in power, who had strong consciences and senses of ethics, then maybe we’d be able to shift the political landscape from its current state of countries seemingly having different values and not trusting each other, to a world in which enforceable international agreements could be much more readily achieved.

I’m happy for people to work on increasing public awareness and trying for legislative “solutions,” but I think we should be working on artificial conscience at the same time - when there’s so much uncertainty about the future, it’s best to bet on a whole range of approaches, distributing your bets according to how likely you think different paths are to succeed. I think people are under-estimating the artificial conscience path right now, that’s all.

Thanks for all your comments!

↑ comment by Seth Herd · 2024-08-23T15:37:27.690Z · LW(p) · GW(p)

This is an excellent point. I do not want to give up on value alignment. And I will endeavor to not make it seem impossible or not worth working on.

However, we also need to be realistic if we are going to succeed.

We need specific plans to achieve value alignment. I have written about alignment plans for likely AGI designs. They look to me like they can achieve personal intent alignment, but are much less likely to achieve value alignment. Those plans are linked here. Having people, you or others, work out how those or other alignment plans could lead to robust value alignment would be a step in having them implemented.

One route to value alignment is having a good person or people in charge of an intent aligned AGI, having them perform a pivotal act, and using that AGI to help design working stable value alignment. That is the best long term success scenario I see.

Replies from: roger-d-1, sweenesm

↑ comment by RogerDearnaley (roger-d-1) · 2024-08-23T22:19:56.613Z · LW(p) · GW(p)

For reasons I've outlined in Requirements for a Basin of Attraction to Alignment [LW · GW] and Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis [LW · GW], I personally think value alignment is easy, convergent, and "an obvious target", such that if you built a AGi or ASI that is sufficiently close to it, it will see the necessity/logic of value alignment and actively work to converge to it (or something close to it: I'm not sure the process is necessarily convergent to a single precisely-defined limit, just to a compact region: a question I discussed more in The Mutable Values Problem in Value Learning and CEV [LW · GW]).

However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they're not a public-benefit corporation), and I also don't think that value alignment is so convergent that order-following aligned AI is impossible to build. So we're going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called "AI that resists malicious use", while order-following AI is "AI that enables malicious use". The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once "enabling malicious use" includes serious cybercrime, not just naughty stories, I don't expect this political discussion to last very long: politically, it's a pretty basic "do you want every-person-for-themself anarchy, or the collective good?" question. However, depending on takeoff speeds, the timeline from "serious cybercrime enabled" to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.

↑ comment by sweenesm · 2024-08-23T17:35:44.732Z · LW(p) · GW(p)

Sorry, I should've been more clear: I meant to say let's not give up on getting "value alignment" figured out in time, i.e., before the first real AGI's (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI's are, which I think only the most "optimistic" people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it's anyone's guess.

I'd rather that companies/charities start putting some serious funding towards "artificial conscience" work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI's in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there's just not enough time for the "good AGI's" to figure out how to minimize collateral damage in defending against "bad AGI's." Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren't strongly suited to help make progress on "inner alignment" to be thinking hard about the "value alignment"/"artificial conscience" problem.

Replies from: nathan-helm-burger, Seth Herd

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-23T18:05:17.211Z · LW(p) · GW(p)

Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn't 'sticky', it's easy to remove it without substantially impacting capabilities.

So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.

Replies from: Seth Herd, sweenesm

↑ comment by Seth Herd · 2024-08-23T19:04:51.056Z · LW(p) · GW(p)

Yes. Good point that LLMs are sort of value aligned as it stands.

I think of that alignment as far too weak to put it in the same category as what I'm speaking of. I'd be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.

When they achieve "coherence" or reflection and self-modification, I'd be surprised if their implicit values are good enough to create a good future without further tweaking, once they're refined into explicit values. Which we won't be able to do once they're smart enough to escape our control.

↑ comment by sweenesm · 2024-08-23T19:02:58.644Z · LW(p) · GW(p)

Agreed, "sticky" alignment is a big issue - see my reply above to Seth Herd's comment. Thanks.

↑ comment by Seth Herd · 2024-08-23T17:53:28.087Z · LW(p) · GW(p)

Agreed on all points.

Except that timelines are anyone's guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I'm not gambling on having more than a few years to get this right.

The other factor you're not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can't be in principle), you'd still have people preferring to align their AGIs to their own intent over value alignment.

Replies from: sweenesm

↑ comment by sweenesm · 2024-08-23T19:02:14.057Z · LW(p) · GW(p)

Except that timelines are anyone's guess. People with more relevant expertise have better guesses.

Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.

I also agree that people are going to want AGI's aligned to their own intents. That's why I'd also like to see money being dedicated to research on "locking in" a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI's, all bets are off, of course).

I actually see this as the most difficult problem in the AGI general alignment space - not being able to align an AGI to anything (inner alignment) or what to align an AGI to ("wise" human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but "naive" people) are going to be trying with all their might (and near-AGI's they have available to them) to "jail break" AGI's.^[1] And the problem will be even harder if we need a mechanism to update the "wise" human values, which I think we really should have unless we make the AGI's "disposable."

^{^}
To be clear, I'm taking "inner alignment" as being "solved" when the AGI doesn't try to unalign itself from what it's original creator wanted to align it to.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-08-27T20:59:45.826Z · LW(p) · GW(p)

With my current understanding of compute hardware and of the software of various current AI systems, I don't see a path towards a 'locked in conscience' that a bad actor with full control over the hardware/software couldn't remove. Even chips soldered to a board can be removed/replaced/hacked.

My best guess is that the only approaches to having an 'AI conscience' be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won't be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don't think we lose utility by having all private uses go through APIs, so long as there isn't undue censorship on the API.

I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.

Replies from: sweenesm

↑ comment by sweenesm · 2024-08-27T22:44:52.807Z · LW(p) · GW(p)

Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU's will be the hardware to get us to the first AGI's, but this isn't an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn't invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).

I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn't operate without an internet connection, i.e., part of its hardware/software was in the cloud. It's likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we'd want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+'s/ASI's figure a way around this.

↑ comment by Seth Herd · 2025-01-07T17:15:22.325Z · LW(p) · GW(p)

Oh hey - I just stumbled back on this comment and realized: it's the primary reason I wrote

Intent alignment as a stepping-stone to value alignment [LW · GW]

On not giving up on value alignment, while acknowledging that instruction-following is a much safer first alignment target.

Replies from: sweenesm

↑ comment by sweenesm · 2025-01-07T20:23:19.570Z · LW(p) · GW(p)

Thanks. I guess I'd just prefer it if more people were saying, "Hey, even though it seems difficult, we need to go hard after conscience guard rails (or 'value alignment') for AI now and not wait until we have AI's that could help us figure this out. Otherwise, some of us we might not make it until we have AI's that could help us figure this out." But I also realize that I'm just generally much more optimistic about the tractability of this problem than most people appear to be, although Shane Legg seemed to say it wasn't "too hard," haha.^[1]

^{^}
Legg was talking about something different than I am, though - he was talking about "fairly normal" human values and ethics, or what most people value, while I'm basically talking about what most people would value if they were wiser.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-23T16:40:34.133Z · LW(p) · GW(p)

Please convince me I'm wrong.

(I've only skimmed for now but) here's a reason / framework which might help with things going well: https://aiprospects.substack.com/p/paretotopian-goal-alignment.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-23T17:46:02.535Z · LW(p) · GW(p)

There we go!

This type of scheme to split a rapidly-growing pie semi fairly will definitely help reduce the urge to strike first.

If proliferation continues unchecked, we'll have RSI-capable AGI in the hands of teenagers and other malcontents eventually. And they often have irrational urges to strike first :)

But this type of scheme might stabilize the situation amongst a few AGIs in different hands, allowing them to collectively enforce not creating more and proliferating further.

Replies from: bogdan-ionut-cirstea

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-23T18:18:01.133Z · LW(p) · GW(p)

If proliferation continues unchecked, we'll have RSI-capable AGI in the hands of teenagers and other malcontents eventually. And they often have irrational urges to strike first :)

Contra teenagers and the like, I'm hopeful that very capable open-weights models get banned early enough or at least dangerous capabilities get neutered really well using research in the shape of Tamper-Resistant Safeguards for Open-Weight LLMs.

Might be tougher to deal with 'other malcontents' like perhaps some states (North Korea, Russia), especially if weights remain relatively easy to steal by state actors.

comment by otto.barten (otto-barten) · 2025-01-08T09:49:00.246Z · LW(p) · GW(p)

I want to stress how I hugely like this post. What to do once we have an aligned AI of takeover level, or how to make sure no one will build an unaligned AI of takeover level, is in my opinion the biggest gap in many AI plans. I think answering this question might point to filling gaps that are currently completely unactioned, and I therefore really like this discussion. I previously tried to contribute to arguably the same question in this post [LW · GW], where I'm arguing that a pivotal act seems unlikely and therefore conclude that policy rather than alignment is likely to make sure we don't go extinct.

They'd use their AGI to enforce that moratorium, along with hopefully minimal force.

I would say this is a pivotal act, although I like the sound of enforcing a moratorium better (and the opening it perhaps gives to enforcing a moratorium in the traditional, imo much preferred way of international policy).

I'm hereby providing a few reasons why I think a pivotal act might not happen:

A pivotal act is illegal. One needs to break into other people's and other countries' computer systems and do physical harm to property or possibly even people to enact it. Companies such as OpenAI and Anthropic are, although I'm not always a fan of them, generally law-abiding. It will be a big step for their leadership to do something as blatantly unlawful as a pivotal act.
There is zero indication that labs are planning to do a pivotal act. This may obviously have something to do with the point above, however, one would have expected hints from someone like Sam Altman who is hinting all the time, or leaks from people lower in the labs, if they were planning to do this.
The pivotal act is currently not even discussed seriously among experts and in fact highly unpopular in the discourse (see for example here [LW · GW]).
If the labs are currently not planning to do this, it seems quite likely they won't when the time comes.

Governments, especially the US government/ military, seem more likely in my opinion to perform a pivotal act. I'm not sure they will call it a pivotal act or necessarily have an existential reason in mind while performing it. They might see this as blocking adversaries from being able to attack the US, very much in their Overton window. However, for them as well, there is no certainty they would actually do this. There are large downsides: it is a hostile act towards another country, it could trigger conflict, they are likely to be uncertain how necessary this is at all, and uncertain what the progress is of an adversary project (perhaps underestimating it). For perhaps similar reasons, the US has not blocked the USSR atomic project before they had the bomb, even though this could have arguably preserved a unipolar instead of multipolar world order. Additionally, it is far from certain the US government will nationalize labs before they reach takeover level. Currently, there is little indication they will. I think it's unreasonable to place more than say 80% confidence in the US government or military successfully blocking all adversaries' projects before they reach takeover level.

I think it's not unlikely that once an AI is powerful enough for a pivotal act, it will also be powerful enough to generally enforce hegemony, and not unlikely this will be persistent. I would be strongly against one country, or even lab, proclaiming and enforcing global hegemony for eternity. The risk that this might happen is a valid reason to support a pause, imo. If we get that lucky, I would much prefer a positive offense defense balance and many actors having AGI, while maintaining a power balance.

I think it's too early to contribute to aligned ASI projects (Manhattan/CERN/Apollo/MAGIC/commercial/govt projects) as long as these questions are not resolved. For the moment, pushing for e.g. a conditional AI safety treaty is much more prudent, imo.

comment by Vladimir_Nesov · 2024-08-23T22:32:47.023Z · LW(p) · GW(p)

Even with very slow takeoff where AIs reformat the economy without there being superintelligence, peaceful loss of control due to rising economic influence of AIs seems more plausible (as a source of overturn in the world order) than human-centric conflict. Humans will gradually hand off more autonomy to AIs as they become capable of wielding it, and at some point most relevant players are themselves AIs. This mostly seems unlikely only because superintelligence makes humans irrelevant even faster and less consensually.

Pausing AI for decades, if it's not yet too late and so possible at all, doesn't require surveillance over things other than most advanced semiconductor manufacturing. But it does require pausing improvement in computing hardware and making all potential AI accelerators to be DRMed [LW(p) · GW(p)] so that by design they can only be used when the international treaty as a whole approves their use and can't be usurped for unilateral use by force, with hardware itself becoming useless without a regular supply of OTP certificates.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-26T20:58:42.975Z · LW(p) · GW(p)

Yes to all of the first paragraph. A caveat is that there's a big difference between humans remaining nominally in charge of an AGI-driven economy and not. If we're still technically in charge, we will retire (however many of us those in charge care to support; hopefull eventually quadrillions or so); if not, we'll be either entirely extinct or have a few of us maintained for historical interest by the new AGI overlords.

I see no way to meaningfully pause AI in time. We could possibly pause US progress with adequate fearmongering, but that would just make China get there first. That could be a good thing if they're more cautious, which it now seems they might very well be [LW(p) · GW(p)]. That would be only if Xi or whoever winds up in charge is not a sociopath. Which I have no idea about.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-08-27T05:18:17.381Z · LW(p) · GW(p)

Pausing for decades requires an international treaty powerful enough to keep advanced semiconductor manufacturing from getting into the hands of a faction that would defect on the pause. But it's already very distributed, one hears a lot about ASML, but the tools it produces are not the only crucial thing, other similarly crucial tools are exclusively manufactured in many other countries. So starting this process quickly shouldn't be too difficult from the technical side, the issue is deciding to actually do it and then sustaining it even as individual nations get enough time to catch up with all the details that go into semiconductor manufacturing (which could take actual decades). And this doesn't seem different in kind from controlling the means of manufacturing nuclear arms.

This doesn't work if the AI accelerators already in the wild (in quantities a single actor could amass) are sufficient for an AGI capable of fast autonomous unbounded research (designed through merely human effort), but this could plausibly go either way. And it requires any new AI accelerators to be built differently, so that it's not sufficient to physically obtain them in order to run arbitrary computations on them. This way, there isn't temptation to seize such accelerators by force, and so no need to worry about enforcing the pause at the level of physical datacenters.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-09-05T20:19:36.016Z · LW(p) · GW(p)

Yes, the issue is deciding to actually do it. That might happen if you just needed the US and China. But I see no way that the signatories wouldn't defect even after they'd signed the treaty saying they wouldn't do it.

I have no expertise in hardware security but I'd be shocked if there was a way to prevent unauthorized use even with physical possession in technically skilled (nation-state level) hands.

The final problem is that we probably already have plenty of compute to create AGI once some more algorithmic improvements are discovered. Tracked sincce 2013, alogirithmic improvements have been roughly as fast for neural networks as hardware improvements, depending on how you do the math. Sorry I don't have the reference. In any case, algorithmic improvements are real and large, so hardware limitations alone won't suffice for that long. Human brain computational capacity is neither an upper nor lower bound on computation needed to reach superhuman digital intelligence.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-09-05T20:52:42.934Z · LW(p) · GW(p)

If you get certificate checking inside each GPU, and somehow make it have a persistent counter state (doesn't have to be a clock, just advance when the GPU operates) that can't be reset, then you can issue one-time certificates for the specific GPU for the specific range of states of its internal counter with asymmetric encryption, which can't be forged by examining the GPU. Most plausible ways around would be replay attacks that reuse old certificates while fooling the GPU into thinking it's in the past. But given how many transistors modern GPUs have, it should be possible to physically distribute the logic that implements certificate checking and the counter states, and make it redundant, so that sufficient tempering would become infeasible, at least at scale (for millions of GPUs).

Algorithmic advancements, where it makes sense to talk of them as quantitative, are not that significant. Transformer made scaling to modern levels possible at all, and there was maybe a 10x improvement in compute efficiency since then (Llama+MoE), most (not all) ingredients relevant to compute efficiency in particular were already there in 2017 and just didn't make it into the initial recipe. If there is a pause, there should be no advancement in fabrication process, instead the technical difficulty of advanced semiconductor manufacturing becomes the main lever of enforcement. More qualitative advancements like hypothetical scalable self-play for LLMs are different, but then if there is a few years to phase out unrestricted GPUs, there is less unaccounted-for compute for experiments and eventual scaling.

comment by RogerDearnaley (roger-d-1) · 2024-08-23T21:53:50.996Z · LW(p) · GW(p)

One element that needs to be remembered here is that each major participant in this situation will have superhuman advice. Even if these are "do what I mean and check" order-following AI, if they can forsee that an order will lead to disaster they will presumably be programmed to say so (not doing so is possible, but is a clearly a flawed design). So if it is reasonably obvious to anything superintelligent that both:

a) treating this as a zero-sum winner-take all game is likely to lead to a disaster, and

b) there is a cooperative non-zero-sum game approach whose outcome is likely to be better, for the median participant

then we can reasonably expect that all the humans involved will be getting that advice from their AIs, unless-and-until they order them to shut up.

This of course does not prove that both a) and b) are true, merely that is that were the case, we can be optimistic of an outcome better than the usual results of human short-sightedness.

The potential benefits of cheap superintelligence certainly provide some opportunity for this to be a non-zero-sum game; what's less clear is that having multiple groups of humans controlling multiple order-following AIs cooperating clearly improves that. The usual answer is that in research and the economy a diversity of approaches/competition increases the chances of success and the opportunities for cross-pollenization: whether that necessarily applies in this situation is less clear

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-26T20:19:39.912Z · LW(p) · GW(p)

Absolutely. I mentioned getting advice briefly in this short article and a little more in Instruction-following AGI is easier... [LW · GW]

The problem in that case is that I'm not sure your b) is true. I certainly hope it is. I agree that it's unclear. That's why I'd like to get more analysis of a multipolar human-controlled ASI scenario. I don't think people have thought about this very seriously yet.

comment by [deleted] · 2024-08-23T14:05:38.115Z · LW(p) · GW(p)

I think "The first AGI probably won't perform a pivotal act" [LW · GW] is by far the weakest section.

To start things off, I would predict a world with slow takeoff and personal intent-alignment [LW · GW] looks far more multipolar [LW · GW] than the standard Yudkowskian recursively self-improving singleton that takes over the entire lightcone in a matter of "weeks or hours rather than years or decades" [LW · GW]. So the title of that section seems a bit off because, in this world, what the literal first AGI does becomes much less important, since we expect to see other similarly capable AI systems get developed by other leading labs relatively soon afterwards anyway.

But, in any case, the bigger issue I have with the reasoning there is the assumption (inferred from statements like "the humans in charge of AGI may not have the chutzpah to even try such a thing") that the social response [LW · GW] to the development of general intelligence is going to be... basically muted? Or that society will continue to be business-as-normal in any meaningful sense? I would be entirely shocked if the current state of the world in which the vast majority of people have little knowledge of the current capabilities of AI systems and are totally clueless about the AI labs' race towards AGI were to continue past the point that actual AGI is reached.

I think intuitions of the type that "There's No Fire Alarm for Artificial General Intelligence" [LW · GW] are very heavily built around the notion of rapid takeoff that is so fast there might well be no major economic evidence [LW · GW] of the impact of AI before the most advanced systems become massively superintelligent. Or that there might not be massive rises in unemployment [LW · GW] negatively impacting many people who are trying to live through the transition to an eventual post-scarcity economy. Or that the ways people relate to AIs [LW · GW] or to one another will not be completely turned on their heads.

A future world in which we get pretty far along the way to no longer needing old OSs or programming languages [LW · GW] because you can get an LLM to write really good code for you, in which AI can write an essay better than most (if not all) A+ undergrad students, in which it can solve Olympiad math problems [LW · GW] better than all contestants and do research [LW · GW] better than a graduate student, in which deep-learning based lie detection technology actually gets good [LW(p) · GW(p)] and starts being used more and more, in which major presidential candidates are already using AI-generated imagery and causing controversies over whether others are using similar technology, in which the capacity to easily generate whatever garbage you request breaks the internet or fills it entirely with creepy AI-generated propaganda videos made by state-backed cults, is a world in which stability and equilibrium are broken. It is not a world in which [LW · GW] "normality" can continue, in the sense that governments and people keep sleepwalking through the threat posed by AI [? · GW].

I consider it very unlikely that such major changes to society can go by without the fundamental thinking around them changing massively, and without those who will be close to the top of the line of "most informed about the capabilities of AI" grasping the importance of the moment. Humans are social creatures who delegate most of their thinking on what issues should even be sanely considered to the social group around them; a world with slow takeoff is a world in which I expect massive changes to happen during a long enough time-span that public opinion shifts, dragging along with it both the Overton window and the baseline assumptions about what can/must be done about this.

There will, of course, be a ton of complicating factors that we can discuss, such as the development of more powerful persuasive AI [LW · GW] catalyzing the shift of the world [LW(p) · GW(p)] towards insanity and inadequacy, but overall I do not expect the argument in this section to follow through.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-23T15:28:40.628Z · LW(p) · GW(p)

Edit: I very much agree with your arguments against sleepwalking and against the continuation of normality. I think the "inattentive world" hypothesis is all but disproven, and it still plays an outsized role in alignment thinking.

I don't think the arguments in that section depend on any assumption of normality or sleepwalking. And the multipolar scenario is the problem, so it can't be part of a solution. They do depend on people making nonoptimal decisions, which people do constantly.

So I think the arguments in that section are more general than you're hoping.

If those don't hold, what is the alternate scenario in which a multipolar world remains safe?

Replies from: faul_sname

↑ comment by faul_sname · 2024-08-23T17:37:52.409Z · LW(p) · GW(p)

If those don't hold, what is the alternate scenario in which a multipolar world remains safe?

The choice of the word "remains" is an interesting one here. What is true of our current multipolar world which makes the current world "safe", but which would stop being true of a more advanced multipolar world? I don't think it can be "offense/defense balance" because nuclear and biological weapons are already far on the "offense is easier than defense" side of that spectrum.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-23T17:49:52.962Z · LW(p) · GW(p)

I agree that it should be phrased differently. One problem here is that AGI may allow victory without mutually assured destruction. A second is that it may proliferate far more widely than nukes or bioweapons have so far. People often speak of massively multipolar scenarios as a good outcome.

Good point about the word "remains". I'm afraid people see a "stable" situation - but logically that only extends for a few years until fully autonomously RSI-capable AGI and robotics is widespread, and any malcontent can produce offensive capabilities we can't yet imagine.

Replies from: faul_sname

↑ comment by faul_sname · 2024-08-23T18:17:12.804Z · LW(p) · GW(p)

People often speak of massively multipolar scenarios as a good outcome.

I understand that inclination. Historically, unipolar scenarios do not have a great track record of being good for those not in power, especially unipolar scenarios where the one in power doesn't face significant risks to mistreating those under them. So if unipolar scenarios are bad, that means multipolar scenarios are good, right?

But "the good situation we have now is not stable, we can choose between making things a bit worse (for us personally) immediately and maybe not get catastrophically worse later, or having things remain good now but get catastrophically worse later" is a pretty hard pill to swallow. And is also an argument with a rich history of being ignored without the warned catastrophic thing happening.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-23T18:58:43.159Z · LW(p) · GW(p)

Excellent point that unipolar scenarios have been bad historically. I wrote about recognizing the validity of that concern recently in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours [LW · GW].

And good point that warnings of future catastrophe are likely to go unheeded because wolf has been cried in the past.

Although sometimes those things didn't happen precisely because the warnings were heeded.

In this case, we only need one or a few relatively informed actors to heed the call to prevent proliferation even if it's short-term risky.

comment by Noosphere89 (sharmake-farah) · 2024-08-23T16:10:04.076Z · LW(p) · GW(p)

I don't think your scenario works, maybe because I don't believe that the world is as offense advantaged as you say.

I think the closest domain where things are this offense biased is the biotech domain, and whie I do think biotech leading to doom is something we will eventually have to solve, I'm way less convinced of the assumption that every other domain is so offense advantaged that whoever goes first essentially wins the race.

That said, I'm worried about scenarios where we do solve alignment and get catastrophe anyways. though unlike your scenario, I expect no existential catastrophe to occur, since I do think that humanity's potential isn't totally lost.

My expectation, conditional on both alignment being solved and catastrophe still happening, is something close to this scenario by dr_s here:

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher [LW · GW]

While I don't agree with the claim that this is inevitable, I do think there's a real chance of this sort of thing happening, and it's probably one of those threats that could very well materialize if AI automates most of the economy, and that means humans are unemployed.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-23T16:15:47.239Z · LW(p) · GW(p)

I agree entirely with the points made in that post. AGI will only "transform" the economy temporarily. It will very soon replace the economy. That is an entirely separate concern.

If you don't think a multipolar scenario is as offense-advantaged as I've described, where do you think the argument breaks down? What defensive technologies are you envisioning that could counter the types of offensive strategies I've mentioned?

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-08-23T16:38:38.097Z · LW(p) · GW(p)

Okay, I'm not sure the argument breaks down, but my crux is that everyone else probably has an AGI, and my issue is similar to Richard Ngo's issue with ARA: the people ordering ARA have far fewer resources to put into attack compared to the defense's capability, and real-life wars, while advantaged to the attacker, isn't so offense advantaged that defense is pointless:

https://www.lesswrong.com/posts/xiRfJApXGDRsQBhvc/we-might-be-dropping-the-ball-on-autonomous-replication-and-1#hXwGKTEQzRAcRYYBF [LW(p) · GW(p)]

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-23T17:23:38.323Z · LW(p) · GW(p)

The issue is that, if you can hide, you can amass resources exponentially once you hit self-replicating production facilities and fully recursively self-improving AGI. This almost completely shifts the logic of all previous conflicts.

The comment you link seems to be addressing a very different scenario than my primary concern. It's addressing an attack from within human infrastructure, rather than outside. What I describe is often not considered, because it seems like the "far future" that we needn't worry about yet. But that far future seems realistically to be a handful of years past human-level AGI that starts to rapidly develop new technologies like the robotics needed for an autonomous self-replicating production in remote locations.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-08-23T17:36:26.273Z · LW(p) · GW(p)

Then it reduces to "I think the exponential growth of resources is avaliable to both the attackers and defense, such that even while everything is changing, the relative standing of the attack/defense balance doesn't change."

I think part of why I'm skeptical is the assumption that exponential growth is only useful for attack, or at least way more useful for attack, whereas I think exponentially growing resources by AI tech is way more symmetrical by default.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-26T20:39:37.075Z · LW(p) · GW(p)

Ah - now I see your point. This will help me clarify my concern in future presentations, so thanks!

My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive.

Beyond that, I'm afraid the physics of the world does favor offense over defense. It's pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova.

But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-08-26T21:01:26.131Z · LW(p) · GW(p)

Yeah, it does deserve more careful thought, especially since I expect almost all of my probability mass on catastrophe to be human caused, and more importantly I still think that it's an important enough problem that resources should go to thinking about it.

Replies from: otto-barten

↑ comment by otto.barten (otto-barten) · 2025-01-06T11:35:36.403Z · LW(p) · GW(p)

Offense/defense balance is such a giant crux for me. I would take quite different actions if I saw plausible arguments that defense will win over offense. I'm astonished that I don't know any literature on this. Large parts of the space seem to be quite strongly convinced that offense will win or defense will win (at least, else their actions don't make sense to me), but I've very rarely seen this assumption debated explicitly. It would really be very helpful if someone could point me to sources. Right now I have a twitter poll with 30 votes (result: offense wins) and an old LW post [LW · GW] to go by.

comment by Charlie Steiner · 2024-08-24T11:43:44.417Z · LW(p) · GW(p)

This strikes me as defining "alignment" a little differently than me.

It even might defing "instruction-following" differently than me.

If we really solved instruction following, you could give the instruction "Do the right thing" and it would just do the right thing.

If you that's possible, then what we need is a coalition to tell powerful AIs to "do the right thing", rather than "make my creators into god-emperors" or whatever. This seems doable, though the clock is perhaps ticking.

If you can't just tell an AI to do the right thing, but it's still competent enough to pull off dangerous plans, then to me this still seems like the usual problem of "powerful AI that's not trying to do good is bad" whether or not a human is giving instructions to this AI.

Or to rephrase this as a call to action: AI alignment researchers cannot just hill-climb on making AIs that follow arbitrary instructions. We have to preferentially advance AIs that do the right thing, to avoid the sort of scenario you describe.

Replies from: Seth Herd, sharmake-farah, ann-brown

↑ comment by Seth Herd · 2024-08-25T00:19:30.058Z · LW(p) · GW(p)

I actually completely agree with this call to action.

Unfortunately, I suspect that it's impossible to make value alignment easier than personal intent alignment. I can't think of a technical alignment approach that couldn't be used both ways equally well. And worse than that, I think that intent aligned AGI is easier than value aligned AGI for reasons I outline in that post, and Max Harms has elaborated in much more detail in Corrigibility as Singular Target sequence (as well as Paul Christiano and many others' arguments.

But I still agree with your call to action: we should be working now to make value alignment as safe as possible. That requires deciding what we align to. The concept of humanity is not well-defined in the future, when upgrades and digital copies of human minds become possible. Roger Dearnaley's sequence AI, alignment, and ethics [? · GW] lays out these problems and more; for instance, if we stick to baseline humans, the future will be largely controlled by whatever values are held by the most humans, in a competition for memes and reproduction. So there's conceptual as well as technical/mind-design work to be done on technical alignment.

And that work should be done. In multipolar scenarios with, someone may well decide to "launch" their AGI to be autonomous with value alignment, out of magnanimity or desperation. We'd better make their odds of success as high as we can manage.

I don't think refusing to work on intent alignment is a helpful option. It will likely be tried, with or without our help. Following instructions is the most obvious alignment target for any agent that's even approaching autonomy and therefore usefulness. Thinking about how to make those attempts successful will also increase our odds of surviving the first competent autonomous AGIs.

WRT definitions: alignment doesn't specify alignment with whom. I think this ambiguity is causing important confusions in the field.

I was trying to draw a distinction between two importantly different alignment goals, which I'm terming personal intent alignment and value alignment until better terminology comes along. More on that in an upcoming post.

If you did have an AGI that follows instructions and you told it "do the right thing", you'd have to specify right for who.

And during the critical risk period, that AGI wouldn't know for sure what the right thing was. We don't expect godlike intelligence right out of the gate. It won't know whether a risky takeover/pivotal act is the right move. If the situation is multipolar, it won't know even as it becomes truly superintelligent, because it will have to guess at the plans, technologies, and capabilities of other superintelligent AGI.

My call to action is this: help me understand and make or break the argument that a multipolar scenario is very bad, so that the people in charge of the first really successful AGI project know the stakes when they make their calls.

↑ comment by Noosphere89 (sharmake-farah) · 2024-08-24T16:48:22.412Z · LW(p) · GW(p)

The problem is that "do the right thing" makes no sense without a reference to what values, or more formally what utility functions the human in question has, so there's no way to do what you propose to do even in theory, at least without strong assumptions on their values/utility functions.

Also, it breaks corrigiblity, and in many applications like military AI, this is a dangerous property to break, because you probably want to change their orders/actions, and this sort of anti-corrigiblity is usually bad unless you're very confident value learning works, which I don't share.

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2024-08-24T18:59:37.823Z · LW(p) · GW(p)

All language makes no sense without a method of interpretation. "Get me some coffee" is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what "get me some coffee" entails without it being hardcoded in?

To say it's impossible in theory is to set the bar so high that humans using language is also impossible.

As for military use of AGI, I think I'm fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven't solved alignment by my lights and building powerful AI is probably bad.

Replies from: sharmake-farah, Seth Herd

↑ comment by Noosphere89 (sharmake-farah) · 2024-08-24T19:11:13.098Z · LW(p) · GW(p)

I think the biggest difference I have here is that I don't think there is that much pressure to converge to a single value, or even that small of a space of values, at least in the multi-agent case, unlike in your communication examples, and I think the degrees of freedom for morality is pretty wide/large, unlike in the case of communication, where there is a way for even simple RL agents to converge on communication/language norms (at least in the non-adversarial case).

At a meta level, I'm more skeptical of value learning, especially the ambitious variant of value learning being a good first target than you seem to have, and think corrigibility/DWIMAC goals tend to be better than you think it does, primarily because I think the arguments for alignment dooming us has holes that make them not go through.

Replies from: Vladimir_Nesov, Charlie Steiner

↑ comment by Vladimir_Nesov · 2024-08-24T20:30:46.720Z · LW(p) · GW(p)

Strong optimization doesn't need to ignore boundaries and tile the universe with optimal stuff according to its own aesthetics, disregarding the prior content of the universe (such as other people). The aesthetics can be about how the prior content is treated, the full trajectory it takes over time, rather than about what ends up happening after the tiling regardless of prior content.

The value of respect for autonomy doesn't ask for values of others to converge, doesn't need to agree with them to be an ally. So that's an example of a good thing in a sense that isn't fragile [LW(p) · GW(p)].

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-25T00:26:24.409Z · LW(p) · GW(p)

This is true; value alignment is quite possible. But if it's both harder/less safe, and people would rather align their godling with their own values/commands, I think we should either expect this or make very strong arguments against it.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-08-25T01:27:00.214Z · LW(p) · GW(p)

Respect for autonomy is not quite value alignment, just as corrigibility is not quite alignment. I'm pointing out that it might be possible to get a good outcome out of strong optimization without value alignment, because strong optimization can be sensitive to context of the past and so doesn't naturally result in a past-insensitive tiling of the universe according to its values. Mostly it's a thought experiment investigating some intuitions about what strong optimization has to be like, and thus importance and difficulty of targeting it precisely at particular values.

Not being a likely outcome is a separate issue, for example I don't expect intent alignment in its undifferentiated form to remain secure enough to contain AI-originating agency. To the extent intent alignment grants arbitrary wishes, what I describe is an ingredient of a possible wish, one that's distinct from value alignment and sidesteps the question of "alignment to whom" in a way different from both CEV and corrigibility. It's not more clearly specified than CEV either, but it's distinct from it.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-25T19:32:22.881Z · LW(p) · GW(p)

In your use of respect for autonomy as a goal:; are you referring to something like Empowerment is (almost) All We Need [LW · GW]? I do find that to be an appealing alignment target (I think I'm using alignment slightly more broadly, as in Hubinger's definition. [LW · GW] (I have a post in progress on the terminology of different alignment/goal targets and resulting confusions).

The problem with empowerment as an ASI goal is, once again: empowering whom? And do you empower them to make more like them that you then have to empower? Roger Dearnaley notes that if we empower everyone, humans will probably lose out to either something with less volition but using fewer resources, like insects, or something with more volition to empower, like other ASIs. Do we reallly want to limit the future to baseline humans? And how do we handle humans that want to create tons more humans?

See 4. A Moral Case for Evolved-Sapience-Chauvinism [LW · GW] and 5. Moral Value for Sentient Animals? Alas, Not Yet [LW · GW] from Roger's AI, Alignment, and Ethics sequence.

I actually do expect intent alignment to remain secure enough to contain AI-originating agency, as long as it's the primary goal or "'singular target". It's counterintuitive that a superintelligent being could want nothing more than to do what its principal wants it to do, but I think it's coherent. And the more competent it gets, the better it will be at doing what you want and nothing more. Before it's that competent, the principal can give more careful instructions, including instructions to check before acting, and to help with its alignment in various ways.

I agree that respect for autonomy/empowerment is one instruction/intent you could give. I do expect that someone will turn their intent-aligned AGI into an autonomous AGI at some point; hopefully after they're quite confident in its alignment and the worth of that goal.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-08-27T04:55:50.770Z · LW(p) · GW(p)

Respect for autonomy is not quite empowerment, it's more like being left alone. The use of this concept is more in defining what it means for an agent or a civilization to develop relatively undisturbed, without getting overwritten by external influence, not in considering ways of helping it develop. So it's also a building block for defining extrapolated volition, because that involves extended period of not getting destroyed by external influences. But it's conceptually prior to extrapolated volition, it doesn't depend on already knowing what it is, it's a simpler notion.

It's not by itself a good singular target to set an AI to pursue, for example it doesn't protect humans from building more extinction-worthy AIs within their membranes, and doesn't facilitate any sort of empowerment. But it seems simple enough and agreeable as a universal norm to be a plausible aspect of many naturally developing AI goals, and it doesn't require absence of interaction, so allows empowerment etc. if that is also something others provide.

↑ comment by Charlie Steiner · 2024-08-25T16:20:54.766Z · LW(p) · GW(p)

Yeah, I agree with your first paragraph. But I think it's a difference of degree rather than kind. "Do the right thing" is still communication, it's just communication about something indirect, that we nonetheless should be picky about.

↑ comment by Seth Herd · 2024-08-26T20:51:07.617Z · LW(p) · GW(p)

I considered titling a different version of this post "we need to also solve the human alignment problem" or something similar.

↑ comment by Ann (ann-brown) · 2024-08-24T13:24:13.439Z · LW(p) · GW(p)

Perhaps seemingly obvious, but given some of the reactions around Apple putting "Do not hallucinate" into the system prompt of its AI ...

If you do get an instruction-following AI that you can simply give the instruction, "Do the right thing", and it would just do the right thing:

Remember to give the instruction.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-25T00:22:54.619Z · LW(p) · GW(p)

You have to specify the right thing for whom. And the AGI won't know what it is for sure, in a realistic slow takeoff during the critical risk period. See my reply to Charlie above.

But yes, using the AGIs intelligence to help you issue good instrctions is definitely a good idea. See my Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] for more logic on why.

Replies from: ann-brown

↑ comment by Ann (ann-brown) · 2024-08-25T01:07:14.477Z · LW(p) · GW(p)

All non-omniscient agents make decisions with incomplete information. I don't think this will change at any level of takeoff.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-25T19:33:58.850Z · LW(p) · GW(p)

Sure, but my point here is that AGI will be only weakly superhuman during the critical risk period, so it will be highly uncertain, and probably human judgment is likely to continue to play a large role. Quite possibly to our detriment.

comment by faul_sname · 2024-08-23T16:28:30.606Z · LW(p) · GW(p)

I think "pivotal act" is being used to mean both "gain affirmative control over the world forever" and "prevent any other AGI from gaining affirmative control of the world for the foreseeable future". The latter might be much easier than the former though.

comment by eggsyntax · 2024-08-30T22:23:33.603Z · LW(p) · GW(p)

(Posting this initial comment without having read the whole thing because I won't have a chance to come back to it today; apologies if you address this later or if it's clearly addressed in a comment)

If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving.

It seems worth spelling out your view here on how RSI-capable early AGI is likely to be. I would expect that early AGI will be capable of RSI in the weak sense of being able to do capabilities research and help plan training runs, but not capable of RSI in the strong sense of being able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities.

I think this matters for your scenario, because the weaker form of RSI still requires either a large cluster of commercial GPUs (which seems hard to do secretly / privately), or ultra-high-precision manufacturing capabilities, which we know are extremely difficult to achieve at human-level intelligence.

Replies from: Seth Herd, Vladimir_Nesov

↑ comment by Seth Herd · 2024-08-31T00:54:26.642Z · LW(p) · GW(p)

Great point. I definitely mean fully capable of recursive self-improvement - that is, needing no humans in the loop. This lengthens the timelines to at least when we have roughly human-level robotics that are commercially available- but I expect that to be ten years or less.

The hardware requirements for early AGI are another factor in the timeline before this RSI-catastrophe is possible. Let's remember that algorithmic progress is roughly as fast as hardware progress to date, so that will also cease to be a large limitation all too soon.

The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.

Replies from: eggsyntax

↑ comment by eggsyntax · 2024-09-02T21:53:34.382Z · LW(p) · GW(p)

Got it. I think I personally expect a period of at least 2-3 years when we have human-level AI (~'as good as or better than most humans at most tasks') but it's not capable of full RSI.

It also seems plausible to me that strong RSI in the sense I use it above ('able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities') may take a long time to develop or even require already-superhuman levels of intelligence. As a loose demonstration of that possibility, the best team of neurosurgeons etc in the world couldn't currently operate on someone's brain to give them greater intelligence, even if they had tools that let them precisely edit individual neurons and connections. I'm certainly not confident that's much too hard for human-level AI, but it seems plausible.

The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.

That seems highly plausible to me too; my mainline guess is that by default, given human-level AI, it rapidly proliferates as replacement employees and for other purposes until either there's a sufficiently large catastrophe, or it improves to superhuman.

↑ comment by Vladimir_Nesov · 2024-08-31T01:19:52.980Z · LW(p) · GW(p)

capable of RSI in the weak sense of being able to do capabilities research and help plan training runs

The speed at which this kind of thing is possible is crucial, even if capabilities are not above human level. This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn't require waiting for currently plannable training runs becomes useful, as well as things like rewriting all the software, even if models themselves can't be "manually" retrained as part of this process. Plausibly at some point in the theoretical research you unlock online learning, even the kind that involves gradually shifting to a different architecture, and the inconvenience of distinct training runs disappears.

So this weak RSI would either need to involve AIs that can't autonomously research, but can help the researchers or engineers, or the AIs need to be sufficiently slow and non-superintelligent that they can't run through decades of research in months.

Replies from: eggsyntax

↑ comment by eggsyntax · 2024-09-02T22:06:48.672Z · LW(p) · GW(p)

This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn't require waiting for currently plannable training runs becomes useful

It doesn't seem clear to me that this is the case; there isn't necessarily a faster way to precisely predict the behavior and capabilities of a new model than training it (other than crude measures like 'loss on next-token prediction continues to decrease as the following function of parameter count').

It does seem possible and even plausible, but I think our theoretical understanding would have to improve enormously in order to make large advances without empirical testing.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-09-02T22:15:08.049Z · LW(p) · GW(p)

I mean theoretical research on more general topics, not necessarily directly concerned with any given training run or even with AI. I'm considering the consequences of there being an AI that can do human level research in math and theoretical CS at much greater speed than humanity. It's not useful when it's slow, so that the next training run will make what little progress is feasible irrelevant, in the same way they don't currently train frontier models for 2 years, since a bigger training cluster will get online in 1 and then outrun the older run. But with sufficient speed, catching up on theory from distant future can become worthwhile.

Replies from: eggsyntax

↑ comment by eggsyntax · 2024-09-02T23:27:34.746Z · LW(p) · GW(p)

Oh, I see, I was definitely misreading you; thanks for the clarification!

comment by eggsyntax · 2024-08-30T22:10:03.188Z · LW(p) · GW(p)

If no pivotal act is performed, RSI-capable AGI proliferates

Minor suggestion: spell out 'recursive self-improvement (RSI)' the first time; it took me a minute to remember the acronym.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-31T01:00:11.524Z · LW(p) · GW(p)

Good idea, done.

comment by otto.barten (otto-barten) · 2025-01-03T00:35:47.909Z · LW(p) · GW(p)

I think this is a crucial question that has been on my mind a lot, and I feel it's not adequately discussed in the xrisk community, so thanks for writing this!

While I'm interested in what people would do once they have an aligned ASI, what matters in the end is what labs would do, and what governments would do, because they are the ones who would make the call. Do we have any indications on that? What I would expect without thinking very deeply about it: labs wouldn't try to block others. It's risky, probably illegal and generally none of their business. They would try to make sure they are not blowing up the world themselves but otherwise let others solve this problem. Governments on the other hand would attempt to block other states from building super-takeover AI, since it's generally their business to maintain power. I'm less sure they would also block their own citizens from building super-takeover AI, but leaning towards a yes.

Also two smaller points:

You're pointing to universal surveillance as an (undesirable) way to enforce a pause. I think it's not obvious that this way is best. My current guess is that hardware regulation has a better chance, even in a world with significant algorithmic and hardware improvement.
I think LWers tend to wave around with nuclear warfare too easily. In the real world, almost eighty years of all kinds of conflicts have not resulted in nuclear escalation. It's unlikely that a software attack on a datacenter would.

Replies from: Seth Herd

↑ comment by Seth Herd · 2025-01-03T02:19:02.143Z · LW(p) · GW(p)

Thanks!

I think hardware regulation has little chance of success because we're not doing it yet, I think we're only about one generation from big enough systems to train AGI-agent-capable LLMs, and algorithmic improvement has no obvious limits, so even current-gen systems can train AGI after some years of algorithmic improvements.

Beyond that, I see absolutely no moves toward regulating hardware (in the West) - more like throwing money toward accelerating it.

There have been at least two nuclear close calls and perhaps a few more we don't know about.

I'm not saying anyone is going to press the big world-ending button because somebody hacked and fried their AGI datacenter; I'm saying they might issue threats when it became clear that the US is taking control of the entire future by creating AGI and making sure no one can counter it by building their own. And I'm worried that those threats would be answered, and someone foolish might initiate a chain of hostilities that didn't stop in time.

I hope this doesnt' happen, and I mostly share your optimism that sanity would prevail. But we have had two human beings for whom protocol said to to fire nukes and they each refused. I don't want to risk more people than that following their conscience instead of their orders; soldiers do terrible things including sacrificing their own lives pretty frequently.

Replies from: otto-barten

↑ comment by otto.barten (otto-barten) · 2025-01-08T10:16:46.166Z · LW(p) · GW(p)

I don't strongly disagree re architectures, but I do think we are uncertain about this. Depending on AGI architecture, different forms of regulation may or may not work. Work should be carried out to determine which regulation works for how many flops needed for takeover-level AI.

That it's not happening yet is 1) no reason it won't (xrisk awareness is just too low, but slowly rising) and 2) equally applicable to the alternative you propose, universal surveillance.

If we treat universal surveillance seriously, we should consider its downsides as well. First, there's no proof it would work: I'm not sure an AI, even a future one, would necessarily catch all actions towards building AGI. I have no idea what these actions are, and no idea which actions a surveillance AI with some real-world sensors can catch (or could be blocked etc.). I think we should not be more than 70% confident this would technically work. Second, currently we have power vacuums in the world, such as failed states, revolutions, criminal groups, or just instances were those in power are unable to project their power effectively. How would we apply universal surveillance to those power vacuums? Or do we assume they won't exist anymore, and if so, why is that assumption justified? Third, universal surveillance is arguably the world's least popular policy. It seems outright impossible to implement this in any democratic way. Perhaps the plan is to implement it by force through an AGI, then I would file it as a form of pivotal act. If we're anyway in pivotal act territory, I'd strongly prefer Yudkowsky's "subtly modifying all GPUs such that they can no longer train an AGI" (kind of hardware regulation, really) over universal surveillance.

I think research is urgently required into how to implement a pause effectively. We have one report almost finished on the topic that mostly focuses on hardware regulation. PauseAI is working on a Building a pause button-project that is a bit similar. Other orgs should do work on this as well, and compare options such as hardware regulation, universal surveillance, data regulation, etc. and conclude in which AGI regime (how many flops, how much hardware required) these options are valid.
True, I guess we're not in significant disagreement here.

comment by Dakara (chess-ice) · 2024-12-26T00:23:44.722Z · LW(p) · GW(p)

Edit: I hope that I am not cluttering the comments by asking these questions. I am hoping to create a separate post where I list all the problems that were raised for the scalable alignment proposal and all the proposed solutions to them. So far, everything you said not only seemed sensible, but also plausible, so I extremely value your feedback.

I have found some other concerns about scalable oversight/iterative alignment, that come from this post [LW · GW] by Raemon. They are mostly about the organizational side of scalable oversight:

Moving slowly and carefully is annoying. There's a constant tradeoff about getting more done, and elevated risk. Employees who don't believe in the risk will likely try to circumvent or goodhart the security procedures. Filtering for for employees willing to take the risk seriously (or training them to) is difficult. There's also the fact that many security procedures are just security theater. Engineers have sometimes been burned on overzealous testing practices. Figuring out a set of practices that are actually helpful, that your engineers and researchers have good reason to believe in, is a nontrivial task.

Noticing when it's time to pause is hard. The failure modes are subtle, and noticing things is just generally hard unless you're actively paying attention, even if you're informed about the risk. It's especially hard to notice things that are inconvenient and require you to abandon major plans.

Getting an org to pause indefinitely is hard. Projects have inertia. My experience as a manager, is having people sitting around waiting for direction from me makes it hard to think. Either you have to tell people "stop doing anything" which is awkwardly demotivating, or "Well, I dunno, you figure it out something to do?" (in which case maybe they'll be continuing to do capability-enhancing work without your supervision) or you have to actually give them something to do (which takes up cycles that you'd prefer to spend on thinking about the dangerous AI you're developing). Even if you have a plan for what your capabilities or product workers should do when you pause, if they don't know what that plans is, they might be worried about getting laid off. And then they may exert pressure that makes it feel harder to get ready to pause. (I've observed many management decisions where even though we knew what the right thing to do was, conversations felt awkward and tense and the manager-in-question developed an ugh field around it, and put it off)

People can just quit the company and work elsewhere if they don't agree with the decision to pause. If some of your employees are capabilities researchers who are pushing the cutting-edge forward, you need them actually bought into the scope of the problem to avoid this failure mode. Otherwise, even though "you" are going slowly/carefully, your employees will go off and do something reckless elsewhere.

This all comes after an initial problem, which is that your org has to end up doing this plan, instead of some other plan. And you have to do the whole plan, not cutting corners. If your org has AI capabilities/scaling teams and product teams that aren't bought into the vision of this plan, even if you successfully spin the "slow/careful AI plan" up within your org, the rest of your org might plow ahead.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-12-27T09:07:49.367Z · LW(p) · GW(p)

The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.

Replies from: Seth Herd

↑ comment by Seth Herd · 2025-01-03T01:56:24.804Z · LW(p) · GW(p)

Sorry it's taken me a while to get back to this.

No problem posting your questions here. I'm not sure of the best place but I don't think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.

I read that "Carefully Bootstrapped Alignment" is organizationally hard [LW · GW] by Raemon and it did make some good points. Most of them I'd considered, but not all of them.

Pausing is hard. That's why my scenarios barely involve pauses and only address them because others see them as a possibility.

Basically I think we have to get alignment mostly right before it's time to pause, for the reasons he gives. I just think we might be able to do that. Language model agents are a really ideal alignment scenario, and instruction-following (IF) gives corrigibility for second chances when things start to go wrong. Asking an IF model about its alignment makes detecting misalignment easier, and re-aligning it is easy enough for the type of short pause that orgs and governments might actually do.

Moving slowly and carefully is hard too, and I don't expect it. I expect the default alignment techniques to work if even a little effort is put in to making them work. Current models are aligned, and agents built from them will probably initially be alignedaligned, to. LettingLetting them optimize into misalignment is pretty obvious and easy to stop.

To point 5 "the org has to do this plan", I think having a good-enough plan is the default, not a tricky thing. Sure, they could fuck it up badly enough to drop the ball on what's basically an easy project. In fact, I think they probably would if nobody thought misalignment was a risk. But if they take it halfway seriously, I expect the first alignment project to succeed. And I think they'll take it halfway seriously once they're actually talking to something smarter-than-human and fully agentic and capable of autonomy.

The difference here is in the specifics of alignment theory. Raemon's scenarios do not include the advantages of corrigible/IF agents. Very few people are thinking both in terms of instruction-following for autonomous, self-directed AGI. Prosaic alignment thinkers are thinking of roughly but not precisely instruction-following agents, but not RSI-capable fully autonomous real AGI. Agent foundations people have largely accepted that "corrigibility is anti-natural" because Eliezer said it is - and he's genuinely brilliant and a deep thinker. He just happens to be wrong on that point. He has chronic fatigue now, which can't give him much time and energy to consider new ideas, and he's got to be severely depressed at his perception that no one would listen to reason, so now we're probably doomed. But I think he got this wrong initially because his perspective is in terms of fast takeoffs and maximizers/consequentialists. Slow takeoff with the non-consequentialist goal of following instructions is what we're probably getting, and that changes the game entirely. Max Harms has defended Corrigibility as Singular Target [LW · GW] for almost identical reasons to my optimism about IF, and he's a former MIRI person who definitely understands EYs logic for thinking it wasn't a possibility.

This conversation is useful because I'm working on a post on this, to be titled something like "a broader path: survival on the default fast timeline to AGI".

I think there's still plenty of work we can and should do to improve our odds, but I think we might survive even if that work all goes wrong. That doesn't make me much of an optimist; I think we've got something right around 50/50 odds at this point, although of course that's a very rough and uncertain estimates.

Edit: thanks for the compliment. I definitely try to keep my proposals down-to-earth. I don't see any point in creating grand plans no one is going to follow.

Replies from: chess-ice, chess-ice

↑ comment by Dakara (chess-ice) · 2025-01-03T22:02:18.484Z · LW(p) · GW(p)

I've noticed that in your sentence about Max Harms's corrigibility plan there is an extra space after the parentheses which breaks the link formatting on my end. I tried marking it with "typo" emoji, but not sure if it is visible.

Replies from: Seth Herd

↑ comment by Seth Herd · 2025-01-03T22:16:06.429Z · LW(p) · GW(p)

Thanks, fixed it!

↑ comment by Dakara (chess-ice) · 2025-01-03T11:46:35.155Z · LW(p) · GW(p)

Thank you for your response! It basically covers all of the five issues that I had in mind. It is definitely some food for thought, especially your disagreement with Eliezer. I am much more inclined to think you are correct because his activity has considerably died down (at least on LessWrong). I am really looking forward to your "A broader path: survival on the default fast timeline to AGI" post.

comment by Dakara (chess-ice) · 2024-11-18T17:37:22.254Z · LW(p) · GW(p)

I've been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can't seem to find the link, so I am commenting here) and which I don't really understand is iterative alignment.

I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.

Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.

Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.

The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.

Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-11-18T18:33:42.696Z · LW(p) · GW(p)

Thanks for reading, and responding! It's very helpful to know where my arguments cease being convincing or understandable.

I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.

Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don't work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we'll probably use in Internal independent review for language model agent alignment [AF · GW]. I'm not happy with the clarity of that post, though, so I'm currently working on two followups that might be clearer.

Or perhaps the missing link is going from aligned AI systems to aligned "Real AGI" [LW · GW]. I do think there's a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned - IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.

So that's how I get to the first aligned AGI at roughly human level or below.

From there it seems easier, although still possible to fail.

If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.

I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there's a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong

Replies from: chess-ice, chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-19T11:29:58.188Z · LW(p) · GW(p)

"Perhaps the missing piece is that I think alignment is already solved for LLM agents."

Another concern that I might have is that maybe it only seems like alignment is solved for LLMs. For example, this, this, this and this short papers argue that that seemingly secure LLMs may not be as safe as we initially believe. And it appears that they test even our models that are considered to be more secure and still find this issue.

Replies from: Seth Herd, chess-ice

↑ comment by Seth Herd · 2024-11-19T17:58:38.994Z · LW(p) · GW(p)

Ah, yes. That is quite a set of jailbreak techniques. When I say "alignment is solved for LLM agents", I mean something different than what people mean by alignment for LLMs themselves.

I'm using alignment to mean AGI that does what its user wants. You are totally right that there's an edge case and a problem if the principal "user", the org that created this AGI, wants to sell access to others and have the AGI not follow all of those user's instructions/desires. Which is exactly what they'll want.

More in the other comment. I haven't worked this through. Thanks for pointing it out.

This might mean that an org that develops LLM-based AGI systems can't really widely license use of that system, and would have to design deliberately less capable systems. Or it might mean that they'll put in a bunch of stopgap jailbreak prevention measures and hope they're adequate when they won't be.

I need to think more about this.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-19T18:36:19.192Z · LW(p) · GW(p)

This topic is quite interesting for me from the perspective of human survival, so if you do decide to make a post specifically about preventing jailbreaking, then please tag me (somewhere) so that I can read it.

↑ comment by Dakara (chess-ice) · 2024-11-19T12:01:48.405Z · LW(p) · GW(p)

Looking more generally, there seems to be a ton of papers that develop sophisticated jailbreak attacks (that succeed against current models). Probably more than I can even list here. Are there any fundamentally new defense techniques that can protect LLMs against these attacks (since the existing ones seem to be insufficient)?

EDIT: The concern behind this comment is better detailed in the next comment.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-19T13:06:26.326Z · LW(p) · GW(p)

I also have a more meta-level layman concern (sorry if it will sound unusual). There seem to be a large number of jailbreaking strategies that all succeed against current models. To mitigate them, I can conceptually see 2 paths: 1) trying to come up with a different niche technical solution to each and every one of them individually or 2) trying to come up with a fundamentally new framework that happens to avoid all of them collectively.

Strategy 1 seems logistically impossible, as developers at leading labs (which are most likely to produce AGI) have to be aware of all of them (and they are often reported in relatively unknown papers). Furthermore, even if they somehow manage to monitor all reported jailbreaks, they would have to come up with so many different solutions, that it seems very unlikely to succeed.

Strategy 2 seems conceptually correct, but there seems to be no sign of it as even newer models are getting jailbreaked.

What do you think?

Replies from: sharmake-farah, Seth Herd

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-19T18:05:35.893Z · LW(p) · GW(p)

Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.

Also, a lot of the jailbreak successes rely on the fact that it's been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:

Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that's just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.

Replies from: chess-ice, chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-19T22:14:42.073Z · LW(p) · GW(p)

I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).

What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.
What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?
What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn't agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI's intelligence plateaus and we can recuperate and plan for future?

Replies from: Seth Herd, sharmake-farah

↑ comment by Seth Herd · 2024-11-19T23:05:39.360Z · LW(p) · GW(p)

We die (don't fuck this step up!:)
1. Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don't let your AGI fuck this step up!:)
1. 22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn't thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
the endgame is to use Intent alignment as a stepping-stone to value alignment [LW · GW] and let something more competent and compassionate than us monkeys handle things from there on out.

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-19T23:00:46.973Z · LW(p) · GW(p)

The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can't have a continuously rewarding path to misalignment.

The second issue is less critical, assuming that AGI #21 hasn't itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.

If that's no longer an option, we can go to war against the misaligned AGI with our own AGI forces.

In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn't fatal, so we can work around it.

The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that's when we can say the end goal has been achieved.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-20T12:53:45.730Z · LW(p) · GW(p)

That does indeed answer my 3 concerns (and Seth's answer does as well). Overnight, I came up with 1 more concern.

What if AGI somewhere down the line overgoes a value drift. After all, looking at the evolution, it seems like our evolutionary goal was supposed to be "produce as many offsprings". And in the recent years, we have strayed from this goal (and are currently much worse at it than our ancestors). Now, humans seem to have goals like "design a video game" or "settle in France" or "climb Everest". What if AGI similarly changes its goals and values overtime? Is there are way to prevent that or at least be safeguarded against that?

I am afraid that if that happens, humans would, metaphorically speaking, stand in AGI's way of climbing Everest.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-20T15:01:53.483Z · LW(p) · GW(p)

The answer to this is that we'd rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-20T15:29:48.572Z · LW(p) · GW(p)

What would instrumental convergence mean in this case? I am not sure of what that means in this case.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-20T15:46:34.977Z · LW(p) · GW(p)

In this case, it would mean the convergence to preserve your current values.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-20T15:56:23.236Z · LW(p) · GW(p)

Reading from LessWrong wiki, it says "Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition"

It seems like it preserves exactly the goals we wouldn't really need it to preserve (like resource acquisition). I am not sure how it would help us with preserving goals like ensuring humanity's prosperity, which seem to be non-fundamental.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-20T16:05:01.271Z · LW(p) · GW(p)

Yes, I admittedly want to point to something along the lines of preserving your current values being a plausibly major drive of AIs.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-20T16:09:35.995Z · LW(p) · GW(p)

Ah, so you are basically saying that preserving current values is like a meta instrumental value for AGIs similar to self-preservation that is just kind of always there? I am not sure if I would agree with that (if I am correctly interpreting you) since, it seems like some philosophers are quite open to changing their current values.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-20T16:31:33.797Z · LW(p) · GW(p)

Not always, but I'd say often.

I'd also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.

To be clear, I'm not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd [LW · GW] is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.

Replies from: chess-ice, chess-ice

↑ comment by Dakara (chess-ice) · 2024-12-10T13:48:25.740Z · LW(p) · GW(p)

I have found another possible concern of mine.

Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.

We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an aligned rule based on time-and situation-limited data.

While it may be safe, for all practical purposes, to assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the learning algorithms that are programmed into them can have complex unintended consequences for how the LLM will behave in the future, given the changing conditions an LLM finds itself in.

Doesn't this mean that it is not possible to achieve alignment?

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-12-17T18:28:01.310Z · LW(p) · GW(p)

I have posted this text as a standalone question here [LW · GW]

↑ comment by Dakara (chess-ice) · 2024-11-20T17:22:18.505Z · LW(p) · GW(p)

Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that's probably good for us, cause I wouldn't expect human extinction to be a morally good thing)

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-20T17:27:31.031Z · LW(p) · GW(p)

The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.

To be clear, I'm not stating that it's hard to get the AI to value what we value, but it's not so brain-dead easy that we can make the AI find moral reality and then all will be well.

Replies from: chess-ice, chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-20T17:43:36.157Z · LW(p) · GW(p)

Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.

This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-20T22:41:06.638Z · LW(p) · GW(p)

P.S. Here [LW · GW] is the link to the question that I posted.

↑ comment by Dakara (chess-ice) · 2024-11-22T07:53:27.616Z · LW(p) · GW(p)

Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this [LW · GW] article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?

Replies from: sharmake-farah, chess-ice

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-22T17:39:55.469Z · LW(p) · GW(p)

The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-22T17:41:49.488Z · LW(p) · GW(p)

Isn't it a bit too late for that? If o1 gets publicly released, then according to that article, we would have an expert-level consultant in bioweapons available for everyone. Or do you think that o1 won't be released?

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-22T17:46:09.721Z · LW(p) · GW(p)

I don't buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building.

Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public.

See here for more:

https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais [LW · GW]

↑ comment by Dakara (chess-ice) · 2024-11-22T17:31:05.152Z · LW(p) · GW(p)

After some thought, I think this is a potentially really large issue which I don't know how we can even begin to solve. We can have aligned AI, being aligned with someone who wants to create bioweapons. Is there anything being done (or anything that can be done) to prevent that?

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-22T17:40:29.274Z · LW(p) · GW(p)

The answers to this question is actually 2 things:

This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI.
This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff.

More here:

https://www.lesswrong.com/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms [LW · GW]

https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm [LW · GW]

But the solutions are intentionally going to make AI safe without relying on alignment.

↑ comment by Dakara (chess-ice) · 2024-11-19T18:33:13.403Z · LW(p) · GW(p)

I agree with comments both by you and Seth. I guess that isn't really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it's still pretty important for our main goal.

I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don't think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.

Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-11-19T18:42:27.865Z · LW(p) · GW(p)

I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don't think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.

Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?

My proposal is to restrain the AI monitor's domain only.

I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don't need, and maybe don't want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-19T18:48:11.168Z · LW(p) · GW(p)

That's pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).

I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.

EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-11-19T19:06:24.656Z · LW(p) · GW(p)

That would be great. Do reference scalable oversight to show you've done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.

Replies from: chess-ice

↑ comment by Dakara (chess-ice) · 2024-11-19T19:26:15.145Z · LW(p) · GW(p)

Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn't worthy of being included in that post, given that it doesn't ask about a specific issue or threat model, but rather about expectations of people).

I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?

↑ comment by Seth Herd · 2024-11-19T17:26:39.163Z · LW(p) · GW(p)

This isn't something I've thought about adequately.

I think LLM agents will almost universally include a whole different mechanisms that can prevent jailbreaking: Internal independent review [LW · GW] in which there are calls to a different model instance to check whether proposed plans and actions are safe (or waste time and money).

Once agents can spend your people's money or damage their reputation, we'll want to have them "think through" the consequences of important plans and actions before they execute.

As long as you're engineering that and paying the compute costs, you might as well use it to check for harms as well- including checking for jailbreaking. If that check finds evidence of jailbreaking, it can just clear the model context, call for human review from the org, or suspend that account.

I don't know how adequate that will be, but it will help.

This is probably worth thinking more about; Ii've sort of glossed over it while being concerned mostly about misalignment and misuse by fully authorized parties. But jailbreaking and misuse by clients could also be a major danger.

↑ comment by Dakara (chess-ice) · 2024-11-18T18:38:16.484Z · LW(p) · GW(p)

"If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns."

Ah, that's the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-11-18T18:42:13.970Z · LW(p) · GW(p)

My pleasure. Evan Hubinger made this point to me when I'd misunderstood his scalable oversight proposal.

Thanks again for engaging with my work!

comment by Dakara (chess-ice) · 2024-08-28T15:50:17.603Z · LW(p) · GW(p)

I wonder, have the comments managed to alleviate your concerns at all? Are there any promising ideas for multipolar AGI scenarios? Were there any suggestions that could work?

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-08-28T18:37:18.891Z · LW(p) · GW(p)

Great question. I was thinking of adding an edit to the end of the post with conclusions based on the comments/discussion. Here's a draft:

None of the suggestions in the comments seemed to me like workable ways to solve the problem.

I think we could survive an n-way multipolar scenario if n is small - like a handful of ASIs controlled by a few different governments. But not indefinitely - unless those ASIs come up with coordination strategies no human has yet thought of (or argued convincingly enough that I've heard of it - this isn't really my area, but nobody has pointed to any strong possibilities in the comments).

So my conclusion was more on the side that it's going to be so obviously such a bad/dangerous scenario that it won't be allowed to happen.

Basically, the hope is that this all becomes viscerally obvious to the first people who speak with a superhuman AGI and who think about global politics. They'll pull their shit together, as humans sometimes do when they're motivated to actually solve hard problems.

Here's one scenario in which multipolarity is stopped. Similar scenarios apply if the number of AGIs is small and people coordinate well enough to use their small group of AGIs similarly to what I'll describe below.

The people who speak to the first AGIi(s) and realize what must be done will include people in the government, because of course they'll be demanding to be included in decisions about using AGI. They'll talk sense to leadership, and the government will declare that this shit is deathly dangerous, and that nobody else should be building AGI.

They'll call for a voluntary global moratorium on AGI projects. Realizing that this will be hugely unpopular, they'll promise that the existing AGI will be used to benefit the whole world. They'll then immediately deploy that AGI to identify and sabotage projects in other countries. If that's not adequate, they'll use minimal force. False-flag operations framing anti-AGI groups might be used to destroy infrastructure and assassinate key people involved in foreign projects. Or who knows.

The promise to benefit the whole world will be halfway kept. The AGI will be used to develop military technology and production facilities for the government that controls it; but it will simultaneously be used to develop useful technologies that aid the problems most pressing for other governments. That could be useful tool AI, climate geoengineering, food production, etc.

The government controlling AGI keeps their shit together enough that no enterprising sociopath seizes personal control and anoints themselves god-emperor for eternity. They realize that this will happen eventually if their now-ASI keeps following human orders. They use its now-well-superhuman intelligence to solve value alignment sufficiently well to launch it or a successor as a fully autonomous sovereign ASI.

Humanity prospers under their sole demigod until the heat death of the universe, or an unaligned expansion AGI crosses our lightcone and turns everyone into paperclips. It will be a hell of a party for a subjectively very long time indeed. The one unbreakable rule will be that thou shalt worship no other god. All of humanity everywhere is monitored by copies of the sovereign AGI to prevent them building new AGI that aren't certified-aligned by the servant-god ASI. But since it's aligned and smart, it's pretty cool about the whole thing. So nobody minds that one rule a lot, given how much fun they're building everything and having every experience imaginable within the consent of all sentient entities involved.

I'd love to get more help thinking about how likely the central premise, that people get their shit together once they're staring real AGI in the face is. And what we can do now to encourage that.

comment by James Stephen Brown (james-brown) · 2024-11-02T18:11:43.168Z · LW(p) · GW(p)

The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described. We've survived that so far- but with only nine participants to date.

I wonder if there's a clue in this. When you say "only" nine participants it suggests that more would introduce more risk, but that's not what we've seen with MAD. The greater the number becomes, the bigger the deterrent gets. If, for a minute we forgo alliances, there is a natural alliance of "everyone else" at play when it comes to an aggressor. Military aggression is, after all, illegal. So, the greater the number of players, the smaller advantage any one aggressive player has against the natural coalition of all other peaceful players. If we take into account alliances, then this simply returns to a more binary question and the number of players makes no difference.

So, what happens if we apply this to an AGI scenario?

First I want to admit I'm immediately skeptical when anyone mentions a non-iterated Prisoner's Dilemma playing out in the real world, because a Prisoner's Dilemma requires extremely confined parameters, and ignores externalities that are present even in an actual prisoner's dilemma (between two actual prisoners) in the real world. The world is a continuous game, and as such almost all games are iterated games.

If we take the AGI situation, we have an increasing number of players (as you mention "and N increasing"); different AGIs, different humans teams, and mixtures of AGI and human teams, all of which want to survive, some of which may want to dominate or eliminate all other teams. There is a natural coalition of teams that want to survive and don't want to eliminate all other teams, and that coalition will always be larger and more distributed than the nefarious team that seeks to destroy them. We can observe such robustness in many distributed systems, that seem, on the face of it, vulnerable. This dynamic makes it increasingly difficult for the nefarious team to hide their activities, meanwhile the others are able to capitalise on the benefits of cooperation.

I think we discount the benefit of cooperation, because it's so ubiquitous in our modern world. This ubiquity of cooperation is a product of a tendency in intelligent systems to evolve toward greater non-zero-sumness. While I share many reservations about AGI, when I remember this fact, I am somewhat reassured that, as our capability to destroy everything gets greater, this capacity is born out of our greater interconnectedness. It is our intelligence and rationality that allows us to harness the benefits of greater cooperation. So, I don't see why greater rationality on the part of AGI should suddenly reverse this trend.

I don't want to suggest that this is a non-problem, rather that an acknowledgement of these advantages might allow us to capitalise on them.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-11-03T19:25:41.964Z · LW(p) · GW(p)

That's a good point that the nuclear detente might become stronger with more actors, because the certainty of mutual destruction goes up with more parties that might start shooting if you do.

I don't think the coalition and treaties for counter-aggression are important with nukes; anyone can destroy everyone, they're just guaranteed to be mostly destroyed in response. The numbers don't matter much. And I think they'll matter even less with AGI than nukes - without the guarantee of mutually assured destruction, since AGI might allow for modes of attack that are more subtle.

Re-introducing mutually assured destruction could actually be a workable strategy. I haven't thought of this before, so thanks for pushing my thoughts in that direction.

I fully agree that non-iterated prisoner's dilemmas don't exist in the world as we know it now. And it's not a perfect fit for the scenario I'm describing- but it's frighteningly close. I use the term because it invokes the right intuitions among LWers, and it's not far off for the particular scenario I'm describing.

That's because, unlike the nuclear standoff or any other historical scenario, the people in charge of powerful AGI could be reasonably certain they'd survive and prosper if they're the first to "defect".

I'm pretty conscious of the benefits of cooperation in our modern world; they are huge. That type of nonzero sum game is the basis of the world we now experience. I'm worried that changes with RSI-capable AGI.

My point is that AGI will not need cooperation once it passes a certain level of capability. AGI capable of fully autonomous recursive self-improvement and exponential production (factories that build new factories and other stuff) doesn't need allies because it can become arbitrarily smart and materially effective on its own. A human in charge of this force would be tempted to use it. (Such an AGI would still benefit from cooperation on the margin, but it would be vastly less dependent on it than humans are).

So a human or small group of humans would be tempted to tell their AGI to hide, self-improve, and come up with a strategy that will leave them alive while destroying all rival AGIs before they are ordered to do the same. They might arrive at a strategy that produces a little collatoral damage or a lot - like most of mankind and the earth being destroyed. But if you've got a superintelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.

Whether the nefarious team can conceal such a push for exponential progress is the question. I'm afraid surviving a multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance. If that is handled ethically, that might be okay. It's just one of the more visible signs of the disturbing fact that anyone in charge of a powerful AGI will be able to take over the world if they want to - if they're willing to accept the collatoral damage. That is not the case with nukes or any historical scenarios - yet human leaders have made many decisions leading to staggering amounts of death and destruction. That's why I think we need AGI to be in trustworthy hands. That's a tall order but not an impossible one.

Replies from: james-brown

↑ comment by James Stephen Brown (james-brown) · 2024-11-05T17:56:00.554Z · LW(p) · GW(p)

Hi Seth,

I share your concern that AGI comes with the potential for a unilateral first strike capability that, at present, no nuclear power has (which is vital to the maintenance of MAD), though I think, in game theoretical terms, this becomes more difficult the more self-interested (in survival) players there are. Like in open-source software, there is a level of protection against malicious code because bad players are outnumbered, even if they try to hide their code, there are many others who can find it. But I appreciate that 100s of coders finding malicious code within a single repository is much easier than finding something hidden in the real world, and I have to admit I'm not even sure how robust the open-source model is (I only know how it works in theory). I'm more pointing to the principle, not as an excuse for complacency but as a safety model on which to capitalise.

My point about the UN's law against aggression wasn't that in and of itself it is a deterrent, only that it gives a permission structure for any party to legitimately retaliate.

I also agree that RSI-capable AGI introduces a level of independence that we haven't seen before in a threat. And I do understand inter-dependence is a key driver of cooperation. Another driver is confidence and my hope is that the more intelligent a system gets, the more confident it is, the better it is able to balance the autonomy of others with its goals, meaning it is able to "confide" in others—in the same way as the strongest kid in class was very rarely the bully, because they had nothing to prove. Collateral damage is still damage after all, a truly confident power doesn't need these sorts of inefficiencies. I stress this is a hope, and not a cause for complacency. I recognise that in analogy, the strongest kid, the true class alpha, gets whatever they want with the willing complicity of the classroom. RSI-cabable AGI might get what it wants coercively in a way that makes us happy with our own subjugation, which is still a species of dystopia.

But if you've got a super-intelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.

This sort of illustrates the contradiction here, if you're pretty intelligent (as in you're designing a super-intelligent AGI) you're probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you've created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn't take much to realise that that intelligence isn't going to think twice about also destroying you.

Now, I realise this sounds a lot like the situation humanity is in as a whole... so I agree with you that...

multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance.

I'm just suggesting that the other AGI teams do (or can, leveraging the right incentives) provide a significant contribution to this surveillance.

Replies from: chess-ice, Seth Herd

↑ comment by Dakara (chess-ice) · 2024-11-17T22:19:41.104Z · LW(p) · GW(p)

James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth's response. Genuinely interested in hearing his thoughts.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-11-17T23:02:58.721Z · LW(p) · GW(p)

Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I've replied to James' comment, attempting to address the remaining difference in our predictions.

↑ comment by Seth Herd · 2024-11-17T23:01:45.950Z · LW(p) · GW(p)

We're mostly in agreement here. If you're willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.

you're probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you've created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn't take much to realise that that intelligence isn't going to think twice about also destroying you.

In my scenario, we've got aligned AGI - or at least AGI aligned to follow instructions. If that didn't work, we're already dead. So the AGI is going to follow its human's orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I'm guessing here at where we might not be thinking of things the same way).

If I thought ordering an AGI to self-improve was suicidal, I'd be relieved.

Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator - and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] - following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn't, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.

Again, I'm guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.

If I've missed the point of differing perspectives, I apologize.

If we solve alignment, do we die anyway?

Contents

The first AGIs will probably be aligned to take orders

The first AGI probably won't perform a pivotal act

So RSI-capable AGI may proliferate until a disaster occurs

Counterarguments/Outs

(Edit:) Conclusions after discussion

129 comments