What if "friendly/unfriendly" GAI isn't a thing?

homunq

What if "friendly/unfriendly" GAI isn't a thing?

post by homunq · 2022-04-07T16:54:05.735Z · LW · GW · 4 comments

4 comments

A quick sketch of thoughts. Epistemic status: pure speculation.

Recent AI developments are impressive
- Both text and images this week
- If I don't hesitate to say my dog is thinking, can't really deny that these things are.
Still, there are two important ways these differ from human intelligence.
- How they learn
  - Automatic-differentiation uses a retrospective god's-eye view that I think is probably importantly different from how humans learn. Certainly seems to require more substantially training data for equivalent performance.
- What they want
  - They optimize for easily-measurable functions. In most cases this is something like "predicting this input" or "fool this GAN into thinking you're indistinguishable from this input"
  - This is subject to Goodhart's law.
Eventually, these differences will probably be overcome.
- Learning and acting will become more integrated.
- Motivation will become more realistic.
  - I suspect this will be through some GAN-like mechanism. That is: train a simpler "feelings" network to predict "good things are coming" based on inputs including some simplified view of the acting-network's internals. Then train the more-complex "actor"-network to optimize for making the "feelings" happy.
Note that the kind of solution I suggest above could lead to a "neurotic" AI
- Very smart "actor" network, but "feeling" network not as smart and not pointing towards consistently optimizing over world-states.
- "Actor" is optimizing for "feelings" now, and "feelings" is predicting world-state later, but even if "actor" is smart enough to understand that disconnect it could still have something like pseudo-akrasia (poor term?) about directly optimizing for world-state (or just feelings) later. A simple example: it could be smart enough to wire-head in a way that would increase its own "happiness", but still not do so because the process of establishing wire-heading would not itself be "fun".
What if that's a general truth? That is, any intelligence is either subject to Goodhart's Law in a way so intrinsic/fundamental that it undermines its learning/effectiveness in the real world, OR it's subject to "neuroses" in the sense that it's not consistently optimizing over world-states?
- Thus, either it's not really GAI, or it's not really friendly/unfriendly.
- GA(super)I would still be scarily powerful. But it wouldn't be an automatic "game over" for any less-intelligent optimizers around it. They might still be able to "negotiate" "win-win" accommodations by nudging ("manipulating"?) the AI to different local optima of its "feelings". And the "actor" part of the AI would be aware that they were doing so but not necessarily directly motivated to stop them.

Obviously, the above is unfinished, and has lots of holes. But I think it's still original and thought-provoking enough to share. Furthermore, I think that making it less sketchy would probably make it worse; that is, many of the ideas y'all (collectively) have to fill in the holes here are going to be better than the ones I'd have.

4 comments

Comments sorted by top scores.

comment by jimmy · 2022-04-07T17:48:17.733Z · LW(p) · GW(p)

I don't think your conclusions follow.

Humans get neurotic and goodhart on feelings, so would you say "either it's not really GAI, or it's not really friendly/unfriendly" about humans? We seem pretty general, and if you give a human a gun either they use it to shoot you and take your stuff or they don't.

Similarly, with respect to "They might still be able to "negotiate" "win-win" accommodations by nudging the AI to different local optima of its "feelings" GAN", that's analogous to smart people going to dumb therapists. In my experience, helping people sort out their feelings pretty much requires having thought through the landscape better than they have, otherwise the person "trying to help" just gets sucked into the same troubled framing or has to disengage. That doesn't mean there isn't some room for lower IQ people to be able to help higher IQ people, but it does mean that this only really exists until the higher IQ person has done some competent self reflection. Not something I'd want to rely on.

If we're using humans as a model, there's two different kinds of "unfriendliness" to worry about. The normal one we worry about is when people do violent things which aren't helpful to the individual, like school shootings or unabombering. The other one is what we do to colonies of ants when they're living where we want to build a house. Humans generally don't get powerful enough for the latter to be much of a concern (except in local ways that are really part of the former failure mode), but anything superintelligent would. That gets us right back to thinking about what the hell a human even wants when unconstrained, and how to reliably ensure that things end up aligned once external constraints are ineffective.

Replies from: homunq

↑ comment by homunq · 2022-04-07T18:21:08.424Z · LW(p) · GW(p)

I guess we're using different definitions of "friendly/unfriendly" here. I mean something like "ruthlessly friendly/unfriendly" in the sense that humans (neurotic as they are) aren't. (Yes, some humans appear ruthless, but that's just because their "ruths" happen not to apply. They're still not effectively optimizing for future world-states, only for present feels.)

I think many of the arguments about friendly/unfriendly AI, at least in the earlier stages of that idea (I'm not up on all the latest) are implicitly relying on that "ruthless" definition of (un)friendliness.

You (if I understand) mean "friendly/unfriendly" in a weaker sense, in which humans can be said to be friendly/unfriendly (or neither? Not sure what you'd say about that, but it probably doesn't matter.)

As for the "smart people going to dumb therapists" argument, I think you're going back to a hidden assumption of ruthlessness: if the person knew how to feel better in the future, they would just do that. But what if, for instance, they know how to feel better in the future, but doing that thing wouldn't make them feel better right now unless they first simplify it enough to explain it to their dumb therapist? The dumb therapist is still playing a role.

My point is NOT to say that non-ruthless GASI isn't dangerous. My point is that it's not an automatic "game over" because if it's not ruthless it doesn't just institute its (un)friendly goals; it is at least possible that it would not use all its potential power.

Replies from: jimmy

↑ comment by jimmy · 2022-04-07T20:10:54.522Z · LW(p) · GW(p)

It's possible that it "wouldn't use all it's potential power" in the same sense that a high IQ neurotic mess of a person wouldn't use all of their potential power either if they're too poorly aligned internally to get out of bed and get things done. And while still not harmless, crazy people aren't as scary as coherently ruthless people optimized for doing harm.

But "People aren't ruthless" isn't true in any meaningful sense. If you're an ant colony, and the humans pave over you to make a house, the fact that they aren't completely coherent in their optimization for future states over feelings doesn't change the fact that their successful optimization for having a house where your colony was destroyed everything you care about.

People generally aren't in a position of that much power over other people such that reality doesn't strongly suggest that being ruthful will help them with their goals. When they do perceive that to be the case, you see an awful lot of ruthless behavior. Whether the guy in power is completely ruthless is much less important than whether you have enough threat of power to keep him feeling ruthful towards your existence and values.

When you start positing superintelligence, and it gets smart enough that it actually can take over the world regardless of what stupid humans want, that becomes a real problem to grapple with. So it makes sense that it gets a lot of attention, and we'd have to figure it out even if it were just a massively IQ and internal-coherence boosted human.

With respect to the "smart troubled person, dumb therapist" thing, I think you have some very fundamental misgivings about human aims and therapy. It's by no means trivial to explain in a tangent of a LW comment, but "if the person knew how to feel better in the future, they would just do that" is simply untrue. We do "optimize for feelings" in a sense, but not that one. People choose their unhappiness and their suffering because the alternative is subjectively worse (as a trivial example, would you take a pill that made you blisfully happy for the rest of your life if it came at the cost of happily watching your loved ones get tortured to death?). In the course of doing "therapy like stuff", sometimes you have to make this explicit so that they can reconsider their choice. I had one client, for example, who I led to the realization that his suffering was a result of his unthinking-refusal to give up hope on a (seemingly) impossible goal. Once he could see that this was his choice, he did in fact choose to suffer less and give up on that goal. However, that was because the goal was actually impossible to achieve, and there's no way in hell he'd have given up and chosen happiness if it were at all possible for him to succeed in his hopes.

It's possible for "dumb therapists" to play a useful role, but either those "dumb" therapists are still wiser than the hyperintelligent fool, or else it's the smart one leading the whole show.

Replies from: homunq

↑ comment by homunq · 2022-04-07T22:54:03.890Z · LW(p) · GW(p)

Sure, humans are effectively ruthless in wiping out individual ant colonies. We've even wiped out more than a few entire species of ant. But our ruthfulness about our ultimate goals — well, I guess it's not exactly ruthfulness that I'm talking about...

...The fact that it's not in our nature to simply define an easy-to-evaluate utility function and then optimize, means that it's not mere coincidence that we don't want anything radical enough to imply the elimination of all ant-kind. In fact, I'm pretty sure that for a large majority of people, there's no utopian ideal you could pitch and they'd buy into, that's so radical enough that getting there would imply or even suggest actions that would kill all ants. Not because humanity wouldn't be capable of doing that, just that we're not capable of wanting that, and that fact may be related to our (residual) ruthfulness and to our intelligence itself. And metaphorically, from a superintelligence's perspective, I think that humanity-as-a-whole is probably closer to being Formicidae than it is to being one species of ant.

...

This post, and its line of argument, is not about saying "AI alignment doesn't matter". Of fucking course it does. What I'm saying is: "it may not be the case that any tiny misalignment of a superintelligence is fatal/permanent". Because yes, a superintelligence can and probably will change the world to suit its goals, but it won't ruthlessly change the whole world to perfectly suit its goals, because those goals will not, themselves, be perfectly coherent. And in that gap, I believe there will probably still be room for some amount of humanity or posthumanity-that's-still-commensurate-with-extrapolated-human-values having some amount of say in their own fates.

The response I'm looking for is not at all "well, that's all OK then, we can stop worrying about alignment". Because there's a huge difference between future (post)humans living meagerly under sufferance in some tiny remnant of the world that a superintelligence doesn't happen to care about coherently enough to change, or them thriving as an integral part of the future that it does care about and is building, or some other possibility better or worse than those. But what I am arguing is that I think the "win big or lose big are the only options" attitude I see as common in alignment circles (I know that Eleizer isn't really cutting edge anymore, but, look at his recent April Fools' "joke" for an example) may be misguided. Not every superintelligence that isn't perfectly friendly is terrifyingly unfriendly, and I think that admitting other possibilities (without being complacent about them) might help useful progress in pursuing alignment.

...

As for your points about therapy: yes, of course, my off-the-cuff one-paragraph just-so-story was oversimplified. And yes, you seem to know a lot more about this than I do. But I'm not sure the metaphor is strong enough to make all that complexity matter here.

What if "friendly/unfriendly" GAI isn't a thing?

Contents

4 comments