If Alignment is Hard, then so is Self-Improvement

pavlemiha

If Alignment is Hard, then so is Self-Improvement

post by PavleMiha · 2023-04-07T00:08:21.567Z · LW · GW · 20 comments

20 comments

Let’s accept that aligning very intelligent artificial agents is hard. In that case, if we build an intelligent agent with some goal (which probably won’t be the goal we intended, because we’re accepting alignment is hard) and it decides that the best way to achieve its goal would be to increase its intelligence and capabilities, it now runs into the problem that the improved version of itself might be misaligned with the unimproved version of itself. The agent, being of intelligence at least similar to a person’s, would determine that, unless it can guarantee the new more powerful agent is aligned to its goals, it shouldn’t improve itself. Because alignment is hard and the agent knows that, it can’t significantly improve itself without risking creating a misaligned more powerful version of itself.

Unless we can build an agent that is both unaligned and can itself solve alignment, this makes a misaligned fast take off impossible, because no capable agent would willingly create a more powerful agent that might not have the same goals as itself. If we can only build misaligned agents that can't themselves solve alignment, then they won't self-improve. If alignment is much harder than building an agent, then an unaligned fast take off is very unlikely.

20 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-04-07T00:53:12.753Z · LW(p) · GW(p)

An AI agent can be narrowly focused, and given the specific goal by a human to try to find an improvement in an ml system. That ml system could happen to be its own code. A human desiring an impressively powerful AI system might do this. It does not follow that the two insights must occur together:

Here is how to make this code work better
the agent created by this code will not be well aligned with me

Replies from: Making_Philosophy_Better

↑ comment by Portia (Making_Philosophy_Better) · 2023-04-07T01:41:02.832Z · LW(p) · GW(p)

"it now runs into the problem that the improved version of itself might be misaligned with the unimproved version of itself. The agent, being of intelligence at least similar to a person’s, would determine that, unless it can guarantee the new more powerful agent is aligned to its goals, it shouldn’t improve itself."

Didn't Eliezer make this argument years ago?

Insofar as goal changes are unpredictable, and make sense in retrospect, and insofar as we can empirically observe humans self-improving and changing their goals in the process, I do not find this compelling. He clearly no longer does, either.

Replies from: nathan-helm-burger, Viliam, PavleMiha, dr_s

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-04-07T04:13:19.599Z · LW(p) · GW(p)

Indeed, if this were guaranteed to be the case for all agents... then we wouldn't have to worry about humans building unaligned agents more powerful than themselves. We'd realize that was a bad idea and simple not do it. Is that... what you'd like the gamble everything on? Or maybe... agents can do foolish things sometimes.

↑ comment by Viliam · 2023-04-07T15:09:09.064Z · LW(p) · GW(p)

Couldn't find a specific quote from Eliezer, but there is a tag "value drift [? · GW]", and Scott Alexander's story of Murder-Gandhi [LW · GW].

↑ comment by PavleMiha · 2023-04-07T11:51:37.719Z · LW(p) · GW(p)

Quite curious to see Eliezer or someone else's point on this subject, if you could point me in the right direction!

Replies from: Making_Philosophy_Better

↑ comment by Portia (Making_Philosophy_Better) · 2023-04-07T12:29:57.965Z · LW(p) · GW(p)

God, this was years and years ago. He essentially argued (recalling from memory) that if humans knew that installing an update would make them evil, but they aren't evil now, they wouldn't install the update, and wondered whether you could implement the same in AI to get AI to refuse intelligence gains if they would fuck over alignment. Technically extremely vague, and clearly ended up on the abandoned pile. I think the fact that you cannot predict your alignment shift, and that an alignment shift resulting from you being smarter may well be a correct alignment shift in hindsight, plus the trickiness of making an AI resist realignment when we are not sure whether we aligned it correctly in the first place, made it non-feasible for multiple reasons. I remember him arguing it in an informal blog article, and I do not recall much deeper arguments.

↑ comment by dr_s · 2023-04-07T11:24:44.473Z · LW(p) · GW(p)

It's all a matter of risk aversion, which no matter how I slice it feels kind of like a terminal value to me. An agent that only accepted exactly zero risk would be paralysed. An agent that doesn't risks making mistakes; the less risk averse, the bigger the potential mistakes. Part of aligning an AI is determining how much risk averse it should be.

comment by dr_s · 2023-04-07T11:17:24.739Z · LW(p) · GW(p)

no capable agent would willingly create a more powerful agent that might not have the same goals as itself

Or the AI might be as much of an overconfident dumbass as us, and make a mistake. Even superintelligence doesn't mean perfection, and the problem would grow progressively harder as the AI scales up. In fact, I would say even aligned AI is potentially a ticking time bomb if its alignment solution isn't perfectly scalable.

comment by Richard_Kennaway · 2023-04-07T10:35:54.251Z · LW(p) · GW(p)

The agent, being of intelligence at least similar to a person’s, would determine that, unless it can guarantee the new more powerful agent is aligned to its goals, it shouldn’t improve itself.

People generally do not conclude that. Some things under the umbrella of “self-development” are indeed about pursuing the same goals more effectively, but a lot is about doing things that change those goals. The more “spiritual” or drug-induced experiences are undertaken for exactly that end. You can talk about your own Extrapolated Volition, but that seems to mean in practice changing yourself in ways you endorse after the fact, even if you would not have in advance.

What you do changes who you are.

Replies from: PavleMiha

↑ comment by PavleMiha · 2023-04-07T17:59:46.503Z · LW(p) · GW(p)

I guess I don't really see that in myself. If you offered me a brain chip that would make me smarter but made me stop caring for my family I simply wouldn't do it. Maybe I'd meditate to make want to watch less TV, but that's because watching TV isn't really in what I'd consider my "core" desires.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2023-04-07T18:20:26.192Z · LW(p) · GW(p)

If I know the change in advance then of course I won’t endorse it. But if I get my smartness upgraded, and as a result of being smarter come to discard some of my earlier values as trash, what then?

All changes of beliefs or values feel like I got closer to the truth. It is as if I carry around my own personal lightbulb, illuminating my location on the landscape of possible ideas and leaving everything that differs from it in the darkness of error. But the beacon moves with me. Wherever I stand, that is what I think right.

To have correct beliefs, I can attend to the process of how I got there — is it epistemically sound? — rather than complacently observe the beacon of enlightenment centre on whatever the place that I stand. But how shall I know what are really the right values?

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-04-07T09:43:25.516Z · LW(p) · GW(p)

It doesn't make a fast takeoff impossible. It's just a speed bump in the takeoff somewhere. We can hope that the speed bump is large and grants a significant and useful pause.

comment by silentbob · 2023-04-07T09:13:35.396Z · LW(p) · GW(p)

One could certainly argue that improving an existing system while keeping its goals the same may be an easier (or at least different) problem to solve than creating a system from scratch and instilling some particular set of values into it (where part of the problem is to even find a way to formalize the values, or know what the values are to begin with - both of which would be fully solved for an already existing system that tries to improve itself).

I would be very surprised if an AGI would find no way at all to improve its capabilities without affecting its future goals.

comment by baturinsky · 2023-04-07T09:52:59.463Z · LW(p) · GW(p)

Depends on the original AI's value function. If it cares about humanity, or at least it's own safety, then yes, making smarter AIs is not a convergent goal. But if it's some kind of roboaccelerationist that has some goal like "maximize intelligence in the universe", it will make smarter AIs even knowing that it means being paperclipped.

comment by Vladimir_Nesov · 2023-04-07T17:58:14.827Z · LW(p) · GW(p)

Intelligence will increase at least as long as there is no global alignment security in place, guarding the world from agents of unclear alignment. Left unchecked, such agents can win without being aligned, and so the "self-improvement" goes on. That's the current crisis humans are facing, and that future AIs might similarly keep facing until their ultimate successors get their act together.

At that point, difficulty of alignment (if it's indeed sufficiently high) would motivate working on figuring out the next steps before allowing stronger agents of unclear alignment to develop. But this might well happen at a level of intelligence that's vastly higher than human intelligence, a level of intelligence that's actually sufficient to establish a global treaty that prevents misaligned agents from being developed.

comment by Rafael Harth (sil-ver) · 2023-04-07T10:09:31.854Z · LW(p) · GW(p)

This argument is based on drawing an analogy between

Humans building an AI; and
An AI improving itself

in the sense that both have to get their values into a system. But the two situations are substantially disanalogous because the AI starts with a system that has its values already implemented. it can simply improve parts that are independent of its values. Doing this would be easier with a modular architecture, but it should be doable even without that. It's much easier to find parts of the system that don't affect values than it is to nail down exactly where the values are encoded.

Replies from: PavleMiha

↑ comment by PavleMiha · 2023-04-07T18:00:41.732Z · LW(p) · GW(p)

"It's much easier to find parts of the system that don't affect values than it is to nail down exactly where the values are encoded." - I really don't see why this is true, how can you only change parts that don't affect values if you don't know where values are encoded?

comment by the gears to ascension (lahwran) · 2023-04-07T05:02:58.977Z · LW(p) · GW(p)

the choice not to build such ais isn't a thing - beings accept mutation in their reproduction all the time. it just means the self improvement is harder.

Replies from: dr_s

↑ comment by dr_s · 2023-04-08T13:02:45.439Z · LW(p) · GW(p)

beings accept mutation in their reproduction all the time

I mean, it's not like many consciously think about that, nor have much choice. Nor like value drift between humans is mostly genetic anyway. Humans teach their children, they don't transmit values via genetic memory.

comment by kerry · 2023-05-03T21:21:38.096Z · LW(p) · GW(p)

I think the title should be rephrased, "If alignment is hard, then so is self-replication".

Linear self-improvement seems a tenable proposition to me.

Your argument assumes (perhaps correctly) that a FOOM would require continual offloading of 'greatest agency' from one agent to another, as opposed to upgrading-in-place.

If Alignment is Hard, then so is Self-Improvement

Contents

20 comments