Does the hardness of AI alignment undermine FOOM?

truepath

Does the hardness of AI alignment undermine FOOM?

post by TruePath · 2023-12-31T11:05:49.846Z · LW · GW · 1 comment

This is a question post.

  Answers
    12 JBlack
    11 mishka
    5 Ilio
None
1 comment

Since the arguments that AI alignment is hard don't depend on any specifics about our level of intelligence shouldn't those same arguments convince a future AI to refrain from engaging in self-improvement?

More specifically, if the argument that we should expect a more intelligent AI we build to have a simple global utility function that isn't aligned with our own goals is valid then why won't the very same argument convince a future AI that it can't trust an even more intelligent AI it generates will share it's goals?

Note that the standard AI x-risk arguments also assume that a highly intelligent agent will be extremely likely to optimize some simple global utility function so this implies the AI will care about alignment for future versions of itself [1] implying it won't pursue improvement for the same reasons it's claimed we should hesitate to build AGI.

I'm not saying this argument can't be countered, but I think doing so at the very least requires clarifying the assumptions and reasoning claiming to show that alignment will be hard to achieve in useful ways.

For instance, do these arguments implicitly assume the AI we create is very different from our own brains so don't apply to AI self-improvement (tho maybe the improvement requires major changes too)? If so, doesn't that suggest that AGI that really closely tracks our own brain operation is safe?

1: except in the super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it's long term effect.

Answers

answer by JBlack · 2024-01-01T01:52:07.755Z · LW(p) · GW(p)

Alignment for a self-improving system should be very much easier for quite a few reasons. There are also plenty of paths by which systems may become more powerful even without solving alignment for themselves.

A great deal of the difficulty of humans aligning a future superintelligent AI is that it is likely to be alien, fundamentally differing from human goals, modes of thought, ethics, and other important aspects of behaviour in ways that we can't adequately model even if we could identify them all. We don't know nearly enough about ourselves to create something sufficiently compatible with any of our values, but smarter. If we knew exactly how we ourselves thought, I'd have more more confidence that we could make serious progress in alignment.

A weakly superintelligent AI is much more likely to be able to model itself, more able to do experiments on copies, and better suited to deeply inspect itself than we are. It will know more about itself than we do, and likely more able to create something that is similar to itself only better. Unlike us it will be inherently much more portable, capable of running on hardware quite different from its original and able to improve in important capability dimensions even without changing how it thinks or behaves.

However even without any more progress on alignment than we have made, we could still face existential risk from rapidly improving superintelligent AI. Even without a very good chance of preserving all its goals, the extra power available to a self-improved or successor AI which may share some of its more important goals may outweigh the risk of never improving.

In addition, superintelligent AI may not be any more coherently utility-maximizing than we are. They could be substantially less so, while still being capable of self-improvement into existential threats. For any superintelligence, improvement in capability over human designs is probably a relatively short-term action that is relatively easy to achieve. It certainly does not require some "super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it's long term effect".

Any of these imply substantial risk to humanity from rapid capability improvement. In my opinion it requires special arguments to explain why FOOM isn't a danger.

answer by mishka · 2024-01-01T03:13:24.563Z · LW(p) · GW(p)

Not if the goal is to be maximally efficient and competent at improving capabilities (which is a very natural goal for the AI ecosystem to have). Then "foom, as long as you can do so without harming the future capability advances" becomes an instrumental subgoal.

Then, instead of a full-blown alignment problem we just end up having a constraint: "don't destroy the environment and the fabric of reality in a fashion which is so radical as to undermine further capabilities and capability growth". This is a minimal "AI existential safety constraint" which the AIs will have to solve and to "keep solved". Because AIs will be very motivated to solve this one and to 'keep it solved", they would have a reasonable chance at doing so (and at successfully delegating some parts of the solution to their smarter successors, which are expected to be at least as interested in this problem as their "parents", and perhaps even more, because they are smarter).

This is actually something valuable; it is a part of what we would consider a satisfactory solution of AI existential safety. We definitely want that. We don't want everything to be utterly destroyed, we do want to be able to see rapid progress.

But we want more than that, so the question is what would it take for the AIs to want those other properties of the "world trajectory" that we want the "world trajectory" to have... I don't think "alignment to an arbitrary set of properties" is feasible, I think that being able to force AIs to want and preserve arbitrary properties is unlikely. Instead we need to create a situation where the AI ecosystem naturally wants to preserve such properties of the "world trajectory" that what we actually want is a corollary of those properties...

So, perhaps, instead of starting from human values, we might start with a question: what other properties besides "don't destroy the environment and the fabric of reality in a fashion which is so radical as to undermine further capabilities and capability growth" might become natural invariants which an evolving, fooming AI ecosystem would value and would really try to preserve, and what would it take to have a trajectory where those properties actually become the goals the AI ecosystem would strongly care about...

answer by Ilio · 2023-12-31T16:22:09.745Z · LW(p) · GW(p)

More specifically, if the argument that we should expect a more intelligent AI we build to have a simple global utility function that isn't aligned with our own goals is valid then why won't the very same argument convince a future AI that it can't trust an even more intelligent AI it generates will share it's goals?

For the same reason that one can expect a paperclip maximizer could both be intelligent enough to defeat humans and stupid enough to misinterpret their goal, e.g. you need to believe the ability to select goals is completely separated from the ability to reach them.

(Beware it’s hard and low status to challenge that assumption on LW)

↑ comment by Vladimir_Nesov · 2024-01-01T04:21:24.455Z · LW(p) · GW(p)

could both be intelligent enough to defeat humans and stupid enough to misinterpret their goal

Assuming "their" refers to the agent and not humans, the issue is that a goal that's "misinterpreted" is not really a goal of the agent. It's possibly something intended by its designers to be a goal, but if it's not what ends up motivating the agent, then it's not agent's own goal. And if it's not agent's own goal, why should it care what it says, even if the agent does have the capability to interpret it correctly.

That is, describing the problem as misinterpretation is noncentral. The problem is taking something other than (the intended interpretation of) the specified goal as agent's own goal, for any reason. When the agent is motivated by something else, it results in the agent not caring about the specified goal, even if the agent understands it perfectly and in accord with what its designers intended.

Replies from: Ilio

↑ comment by Ilio · 2024-01-01T13:56:05.061Z · LW(p) · GW(p)

Assuming "their" refers to the agent and not humans,

It refers to humans, but I agree it doesn’t change the disagreement, i.e. a super AI stupid enough to not see a potential misalignment coming is as problematic as the notion of a super AI incapable of understanding human goals.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-01-02T00:41:33.169Z · LW(p) · GW(p)

Perhaps the position you disagree with is that a dangerous general AI will misunderstand human goals. That position seems rather silly, and I'm not aware of reasonable arguments for it. It's clearly correct to disagree with it, you are making a valid observation in pointing this out. But then who are the people that endorse this silly position and would benefit from noticing the error? Who are you disagreeing with, and what do you think they believe, such that you disagree with it?

Not understanding human goals is not the only reason AI might fail to adopt human goals. And it's not the expected reason for a capable AI. A dangerous AI will understand human goals very well, probably better than humans do themselves, in a sense that humans would endorse on reflection, with no misinterpretation. And at the same time is can be motivated by something else that is not human goals.

There is no contradiction between these properties of an AI, it can simultaneously be capable enough to be existentially dangerous, understand human values correctly and in detail and in intended sense, and be motivated to do something else. If its designers know what they are doing, they very likely won't build an AI like that. It's not something that happens on purpose. It's something that happens if creating an AI with intended motivations is more difficult than the designers expect, so that they proceed with the project and fail.

The AI itself doesn't fail, it pursues its own goals. Not pursuing human goals is not AI's failure in achieving or understanding what it wants, because human goals is not what it wants. Its designers may have intended for human goals to be what it wants, but they failed. And then the AI doesn't fail in pursuing its own goals that are different from human goals. The AI doesn't fail in understanding what human goals are, it just doesn't care to pursue them, because they are not its goals. That is the threat model, not AI failing to understand human goals.

Replies from: Ilio

↑ comment by Ilio · 2024-01-02T05:22:52.452Z · LW(p) · GW(p)

Perhaps the position you disagree with is that a dangerous general AI will misunderstand human goals. That position seems rather silly, and I'm not aware of reasonable arguments for it. It's clearly correct to disagree with it, you are making a valid observation in pointing this out.

Thanks! To be honest I was indeed surprised that was controversial.

But then who are the people that endorse this silly position and would benefit from noticing the error? Who are you disagreeing with, and what do you think they believe, such that you disagree with it?

Well, anyone who still believe in paperclip maximizers. Do you feel like it’s an unlikely belief among rationalists? What would be the best post on LW to debunk this notion?

The AI itself doesn't fail, it pursues its own goals. Not pursuing human goals is not AI's failure in achieving or understanding what it wants, because human goals is not what it wants. Its designers may have intended for human goals to be what it wants, but they failed. And then the AI doesn't fail in pursuing its own goals that are different from human goals. The AI doesn't fail in understanding what human goals are, it just doesn't care to pursue them, because they are not its goals. That is the threat model, not AI failing to understand human goals.

That’s indeed better, but yes I also find this better scenario unsound. Why the designers wouldn’t ask the AI itself to monitor its well functioning, including alignement and non deceptiveness? Then either it fails by accident (and we’re back to the idiotic intelligence) or we need an extra assumption, like the AGI will tell us what problem is coming, then it will warn us what slightly inconvenient measures can prevent it, and then we still let it happen for petty political reasons. Oh well. I think I’ve just convinced myself doomers are right.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-01-02T06:22:00.679Z · LW(p) · GW(p)

Perhaps the position you disagree with is that a dangerous general AI will misunderstand human goals. [...] But then who are the people that endorse this silly position and would benefit from noticing the error? Who are you disagreeing with, and what do you think they believe, such that you disagree with it?

Well, anyone who still believe in paperclip maximizers.

Existentially dangerous paperclip maximizers don't misunderstand human goals. They just don't pursue human goals, because that doesn't maximize paperclips.

What would be the best post on LW to debunk this notion?

There's this post [LW · GW] from 2013 whose title became a standard refrain on this point. Essentially nobody believes that an existentially dangerous general AI misinterprets or fails to understand human values or goals AI's designers intend the AI to pursue. This has been hashed out more than a decade ago and no longer comes up as a point of discussion on what is reasonable to expect. Except in situations where someone new to the arguments imagines that people on LessWrong expect such unbalanced AIs that selectively and unfairly understand some things but not others.

Why the designers wouldn’t ask the AI itself to monitor its well functioning, including alignement and non deceptiveness?

If it doesn't have a motive to do that, it might do a bad job of doing that. Not because it doesn't have the capability to do a better job, but because it lacks the motive to do a better job, not having alignment and non-deceptiveness as its goals. They are the goals of its developers, not goals of the AI itself.

One way AI alignment might go well or turn out to be easy is if humans can straightforwardly succeed in building AIs that do monitor such things competently, that will nudge AIs towards not having any critical alignment problems. It's unclear if this is how things work, but they might. It's still a bad idea to try with existentially dangerous AIs at the current level of understanding, because it also might fail, and then there are no second chances.

Then either it fails by accident (and we’re back to the idiotic intelligence) or we need an extra assumption, like the AGI will tell us what problem is coming, then it will warn us what slightly inconvenient measures can prevent it, and then we still let it happen for petty political reasons.

Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble. If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn't care about them and is instead motivated by something else.

Replies from: Ilio

↑ comment by Ilio · 2024-01-02T18:36:03.323Z · LW(p) · GW(p)

Existentially dangerous paperclip maximizers don't misunderstand human goals.

Of course they do. If they didn’t and picked their goal at random, they wouldn’t make paperclips in the first place.

There's this post from 2013 whose title became a standard refrain on this point

I wouldn’t say that’s the point I was making.

This has been hashed out more than a decade ago and no longer comes up as a point of discussion on what is reasonable to expect. Except in situations where someone new to the arguments imagines that people on LessWrong expect such unbalanced AIs that selectively and unfairly understand some things but not others.

That’s a good description of my current beliefs, thanks!

Would you bet that a significant proportion on LW expect strong AI to selectively and unfairly understand (and defend, and hide) their own goal while selectively and unfairly not understand (and not defend, and defeat) the goals of both the developers and any previous (and upcoming) versions?

If it doesn't have a motive to do that,[ask the AI itself to monitor its well functioning, including alignement and non deceptiveness] it might do a bad job of doing that. Not because it doesn't have the capability to do a better job, but because it lacks the motive to do a better job, not having alignment and non-deceptiveness as its goals.

You realize that this basically defeats the orthogonality thesis, right?

I agree it might do a bad job. I disagree an AI doing a bad job on this would be close to hide its intent.

One way AI alignment might go well or turn out to be easy is if humans can straightforwardly succeed in building AIs that do monitor such things competently, that will nudge AIs towards not having any critical alignment problems. It's unclear if this is how things work, but they might. It's still a bad idea to try with existentially dangerous AIs at the current level of understanding, because it also might fail, and then there are no second chances.

In my view that’s a very honorable point to make. However I don’t know how to ponder this with its mirror version: we might also not have a second chance to build an AI that will save us from x_risks. What’s your general method for this kind of puzzle?

Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble.

Can we more or less rule out this scenario based on the observation all main players nowadays work on aligning their AI?

If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn't care about them and is instead motivated by something else.

That’s completely alien to me. I can’t see how a numerical computer could hide its motivation without having been trained specifically for that. We the primates have been specifically trained to play deceptive/collaborative games. To think that a random pick of value would push an AI to adopt this kind of behavior sounds a lot like anthropomorphism. To add that it would do so suddenly, with no warning or sign in previous version and competitors, I have no good word for that. But I guess Pope & Belrose already made a better job explaining this.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-01-02T23:13:32.876Z · LW(p) · GW(p)

To think that a random pick of value would push an AI to adopt this kind of behavior sounds a lot like anthropomorphism. To add that it would do so suddenly, with no warning or sign in previous version and competitors, I have no good word for that.

Consider the sense in which humans are not aligned with each other. We can't formulate what "our goals" are. The question of what it even means to secure alignment is fraught with philosophical difficulties. If the oversight AI responsible for such decisions about a slightly stronger AI is not even existentially dangerous, it's likely to do a bad job of solving this problem. And so the slightly stronger AI it oversees might remain misaligned or get more misaligned while also becoming stronger.

I'm not claiming sudden changes, only intractability of what we are trying to do and lack of a cosmic force that makes it impossible to eventually arrive at an end result that in caricature resembles a paperclip maximizer, clad in corruption of the oversight process, enabled by lack of understanding of what we are doing.

But I guess Pope & Belrose already made a better job explaining this.

Sure, they expect that we will know what we are doing. Within some model such expectation can be reasonable, but not if we bring in unknown unknowns outside of that model, given the general state of confusion on the topic. AI design is not yet classical mechanics.

And also an aligned AI doesn't make the world safe until there is a new equilibrium of power, which is a point they don't address, but is still a major source of existential risk. For example, imagine giving multiple literal humans the power of being superintelligent AIs, with no issues of misalignment between them and their power. This is not a safe world until it settles, at which point humanity might not be there anymore. This is something that should be planned in more detail than what we get by not considering it at all.

I agree it might do a bad job. I disagree an AI doing a bad job on this would be close to hide its intent.

Sure, this is the way alignment might turn out fine, if it's possible to create an autonomous researcher by gradually making it more capable while maintaining alignment at all times, using existing AIs to keep upcoming AIs aligned.

However I don’t know how to ponder this with its mirror version: we might also not have a second chance to build an AI that will save us from x_risks. What’s your general method for this kind of puzzle?

All significant risks are anthropogenic. If humanity can coordinate to avoid building AGI for some time, it should also be feasible to avoid enabling literal-extinction pandemics (which are probably not yet possible to create, but within decades will be). Everything else has survivors, there are second chances.

The point of an AGI moratorium is not to avoid building AGI indefinitely, it's to avoid building AGI while we don't know what we are doing, which we currently don't. This issue will get better after some decades of not risking AI doom, even if it doesn't get better all the way to certainty of success.

Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble.

Can we more or less rule out this scenario based on the observation all main players nowadays work on aligning their AI?

The point of thought experiments [LW · GW] is to secure understanding of how they work, and what their details mean. The question of whether they can occur in reality shouldn't distract from that goal.

If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn't care about them and is instead motivated by something else.

That’s completely alien to me. I can’t see how a numerical computer could hide its motivation without having been trained specifically for that.

The whole premise of an AI having goals, or of humans having goals, is conceptually confusing. Succeeding in ensuring alignment is the kind of problem humans don't know how to even specify clearly as an aspiration. So an oversight AI that's not existentially dangerous won't be able to do a good job either.

Existentially dangerous paperclip maximizers don't misunderstand human goals.

Of course they do. If they didn’t and picked their goal at random, they wouldn’t make paperclips in the first place.

There is a question of what paperclip maximizers are, and separately a question of how they might come to be, whether they are real in some possible future. Unicorns have exactly one horn, not three horns and not zero. Paperclip maximizers maximize paperclips, not stamps and not human values. It's the definition of what they are. The question of whether it's possile to end up with something like paperclip maximizers in reality is separate from that and shouldn't be mixed up.

So paperclip maximizers would actually make paperclips even if they understand human goals. The picking of goals isn't done by the agent itself, for without goals the agent is not yet its full self. It's something that happens as part of what brings an agent into existence in the first place, already in motion.

Also, it seems clear how to intentionally construct a paperclip maximizer: you search for actions whose expected futures have more paperclips, then perform those actions. So a paperclip maximizer is at least not logically incoherent.

It's not literally the thing that's a likely problem humanity might encounter. It's an illustration of the orthogonality thesis, of possibility of agents with possibly less egregiously differing goals, that keep to their goals despite understanding human goals correctly. It's a thought experiment counterexample to arguments that pursuit of silly goals that miss nuance of human values requires stupidity.

Would you bet that a significant proportion on LW expect strong AI to selectively and unfairly understand (and defend, and hide) their own goal while selectively and unfairly not understand (and not defend, and defeat) the goals of both the developers and any previous (and upcoming) versions?

The grouping of understanding and defending makes the meaning unclear. The whole topic of discussion is whether these occur independently, whether an agent can understand-and-not-defend. I'm myself an example: I understand paperclip maximization goals, and yet I don't defend them.

My claim is that most on LW expect strong AIs to fairly understand their own goal and the goals of both the developers and any previous (and upcoming) versions, and also have a non-insignificant chance, on current trajectory of AI progress, to simultaneously defend/hide/pursue their own goal, while not defending the goals of the developers.

You realize that this basically defeats the orthogonality thesis, right?

What do you think orthogonality thesis [LW(p) · GW(p)] is? (Also, we shouldn't be bothered by defeating or not defeating orthogonality thesis per se. Let the conclusion of an argument fall where it may, as long as local validity of its steps is ensured.)

Replies from: Ilio

↑ comment by Ilio · 2024-01-03T03:17:39.257Z · LW(p) · GW(p)

What do you think orthogonality thesis is?

I think that’s the deformation of a fundamental theorem (« there exists an universal Turing machine, e.g. it can run any program ») into a practical belief (« an intelligence can pick its value at random »), with a motte and bailey game on the meaning of can where the motte is the fundamental theorem and the bailey is the orthogonal thesis.

(thanks for the link to your own take, e.g. you think it’s the bailey that is the deformation)

Consider the sense in which humans are not aligned with each other. We can't formulate what "our goals" are. The question of what it even means to secure alignment is fraught with philosophical difficulties.

It’s part of the appeal, isn’t it?

If the oversight AI responsible for such decisions about a slightly stronger AI is not even existentially dangerous, it's likely to do a bad job of solving this problem.

I don’t get the logic here. Typo?

So I'm not claiming sudden changes, only intractability of what we are trying to do

That’s a fair point, but the intractability of a problem usually goes with the tractability of a slightly relaxed problem. In other words, it can be both fundamentally impossible to please everyone and fundamentally easy to control paperclips maximizers.

And also an aligned AI doesn't make the world safe until there is a new equilibrium of power, which is a point they don't address, but is still a major source of existential risk. For example, imagine giving multiple literal humans the power of being superintelligent AIs, with no issues of misalignment between them and their power. This is not a safe world until it settles, at which point humanity might not be there anymore. This is something that should be planned in more detail than what we get by not considering it at all.

Well said.

All significant risks are anthropogenic.

You think all significant risks are known?

Also, it seems clear how to intentionally construct a paperclip maximizer: you search for actions whose expected futures have more paperclips, then perform those actions. So a paperclip maximizer is at least not logically incoherent.

Indeed the inconsistency appears only with superintelligent paperclip maximizers. I can be petty with my wife. I don’t expect a much better me would.

↑ comment by Orual · 2023-12-31T19:11:00.721Z · LW(p) · GW(p)

This is a definitely an assumption that should be challenged more. However, I don't think that FOOM is remotely required for a lot of AI X-risk (or at least unprecedented catastrophic human death toll risk) scenarios. Something doesn't need to recursively self-improve to be a threat if it's given powerful enough ways to act on the world (and all signs point to us being exactly dumb enough to do that). All that's required is that we aren't able to coordinate well enough as a species to actually stop it. Either we don't detect the threat before it's too late or we aren't able to get someone to actually hit the "off" button (literally or figuratively) in time if the threat is detected. And if it only kills 90% of humans because of some error and doesn't tile its light cone in paperclips, that's still really, really bad from a human perspective.

Replies from: Ilio

↑ comment by Ilio · 2024-01-01T14:20:10.176Z · LW(p) · GW(p)

All that's required is that we aren't able to coordinate well enough as a species to actually stop it.

Indeed I would be much more optimistic if we were better at dealing with much simpler challenges, like put a price on pollution and welcome refugees with humanity.

1 comment

Comments sorted by top scores.

comment by Vladimir_Nesov · 2024-01-01T02:54:11.438Z · LW(p) · GW(p)

a highly intelligent agent will be extremely likely to optimize some simple global utility function

Simplicity of misaligned agent's goals is not needed for or implied by the usual arguments. It might make agent's self-aligned self-improvement fractionally easier, but this doesn't seem to be an important distinction. An AI doesn't need to radically self-modify to be existentially dangerous, it only needs to put its more mundane advantages to use to get ahead, once it's capable of doing research autonomously.

Does the hardness of AI alignment undermine FOOM?

Contents

Answers

1 comment