Why do you believe AI alignment is possible?

post by acylhalide (samuel-shadrach) · 2021-11-15T10:05:57.255Z · LW · GW · 45 comments

This is a question post.

Geneuinely curious, has someone recently (< 5 years old) made a comprehensive post on what key points lead them to believe AI alignment is possible? Or is it just vague "we don't have any clue whatsoever but we shouldn't give up". Ideally the post should demonstrate deep understanding of all the problems and failed attempts that have already been tried.

And if key points don't exist, some vague inklings of your worldview or perspective, anything will do. I'll even accept poetry if you decide that's the only form of communication that will enable you successfully communicate anything meaningful. (Only half joking)

(I ask because my mind defaults to assuming it being an impossible problem to solve, but I'm keen on reading any perspectives that change that.)

I tried hard to find a post, couldn't so far. Just some replies here there buried in threads.

Answers

answer by TekhneMakre · 2021-11-15T11:21:25.979Z · LW(p) · GW(p)
I'll even accept poetry

I will now drive a small truck through the door you left ajar. (This is indeed bad poetry, so it's not coherent and not an answer and also not true, but it has some chance of being usefully evocative.)

It seems as though when I learn new information, ideas, or thought processes, they become available for my use towards my goals, and don't threaten my goals. To judge between actions, usually most of what I want to attend to figuring out is the likely consequences of the actions, rather than the evaluation of those consequences (excepting evaluations that are only about further consequences), indicating that to the extent my values are able to notice when they are under threat, they are not generally under threat by other thought processes. When I have unsatisfied desires, it seems they're usually mostly unsatisfied because I don't know which actions to take to bring about certain consequences, and I can often more or less see what sort of thing I would do, at some level of abstraction, to figure out which actions to take; suggesting that there is such a thing as "mere problem solving thought", because that's the sort of thought that I think I can see as a meta-level plan that would work, i.e., my experience from being a mind suggests that there is an essentially risk-free process I can undertake to gain fluency in a domain that lays the domain bare to the influence of my values. An FAI isn't an artifact, it's a hand and an eye. The FAI doing recursive self-improvement is the human doing recursive self-improvement. The FAI is densely enmeshed in low-latency high-frequency feedback relationships with the humans that resemble the relationships between different mental elements of my mental model of the room around me, or between those and the micro-tasks I'm performing and the context of those micro-tasks. A sorting algorithm has no malice, a growing crystal has no malice, and likewise a mind crystallizing well-factored ontology, from the primordial medium of low-impact high-context striving, out into new domains, has no malice. The neocortex is sometimes at war with the hardwired reward, but it's not at war with Understanding, unless specifically aimed that way by social forces; there's no such thing as "values" that are incompatible with Understanding, and all that's strictly necessary for AGI is Understanding, though we don't know how to sift baby from bath-lava. The FAI is not an agent! It defers to the human not for "values" or "governance" or "approval" but for context and meaning and continuation; it's the inner loop in an intelligent process, the C code that crunches the numbers. The FAI is a mighty hand growing out of the programmer's forehead. Topologically the FAI is a bubble in space that is connected to another space; metrically the FAI bounds an infinite space (superintelligence), but from our perspective is just a sphere (in particular, it's bounded). The tower of Babylon, but it's an inverted pyramid that the operator balances delicately. Or, a fractal, a tree say, where the human controls the angle of the branches and the relative lengths, propagating infinitely up but with fractally bounded impact. Big brute searches in algorithmically barren domains, small careful searches in algorithmically rich domains. The Understanding doesn't come from consequentialist reasoning; consequentialist reasoning constitutively requires Understanding; so the door is open to just think and not do anything. Algorithms have no malice. Consequentialist reasoning has malice. (Algorithms are shot through with consequentiality, but that's different from being aimed at consequences.) I mostly don't gain Understanding and algorithms via consequentialist reasoning, but by search+recognition, or by the play of thoughts against each other. Search is often consequentialist but doesn't have to be. One can attempt to solve a Rubik's cube without inevitably disassembling it and reassembling it in order. The play of thoughts against each other is logical coherence, not consequentialism. The FAI is not a predictive processor with its set-point set by the human, the FAI and the human are a single predictive processor.

comment by acylhalide (samuel-shadrach) · 2021-11-15T11:43:59.437Z · LW(p) · GW(p)

Thanks for this.

Would the physical implementation of this necessarily be a man-machine hybrid? Communication directly via neurochemical signals, atleast in early stages. Or could the "single predictive processor" still have two subagents (man and machine) able to talk to each other in english? (If you don't have thoughts on physical implementation that's fine too.)

An FAI isn't an artifact, it's a hand and an eye. 

Would this be referring to superhuman capabilities that are narrow in nature? There's a difference between a computer that computes fluid dynamics or game theory solutions at an absurd pace but using well-known algorithms, versus a general intelligence. The former does feel like a hand and an eye, atleast today, the latter I'm far less confident can feel anything but alien. Also the former tends to push against the definition of what an AI even is.

Also - even hands and eyes can be existentially dangerous, see Vulnerable World hypothesis. Black balls are not even intelligent, they are still tools humans can deploy.

But I agree, humans with sufficiently many metaphorical hands and eyes in the year 2200 could look superintelligent to humans in 2021, same as how our current reasoning capacity in math, cogsci, philosophy etc could look superhuman to cavemen.

Replies from: TekhneMakre
comment by TekhneMakre · 2021-11-15T20:01:38.842Z · LW(p) · GW(p)
even hands and eyes can be existentially dangerous

That seems right, though it's definitely harder to be an x-risk without superintelligence; e.g. even a big nuclear war isn't a guaranteed extinction, nor an extremely infectious and lethal virus (because, like, an island population with a backup of libgen could recapture a significant portion of value).

necessarily be a man-machine hybrid

I hope not, since that seems like an additional requirement that would need independent work. I wouldn't know concretely how to use the hybridizing capability, that seems like a difficult puzzle related to alignment. I think the bad poetry was partly trying to say something like: in alignment theory, you're *trying* to figure out how to safely have the AI be more autonomous---how to design the AI so that when it's making consequential decisions without supervision, it does the right thing or at least not a permanently hidden or catastrophic thing. But this doesn't mean you *have to* "supervise" (or some other high-attention relationship that less connotes separate agents, like "weild" or "harmonize with" or something) the AI less and less; more supervision is good.

subagents (man and machine) able to talk to each other in english?

IDK. Language seems like a very good medium. I wouldn't say subagent though, see below.

Would this be referring to superhuman capabilities that are narrow in nature?

This is a reasonable interpretation. It's not my interpretation; I think the bad poetry is talking about the difference between one organic whole vs two organic wholes. It's trying to say that having the AI be genuinely generally intelligent doesn't analytically imply that the AI is "another agent". Intelligence does seem to analytically imply something like consequentialist reasoning; but the "organization" (whatever that means) of the consequentialist reasoning could take a shape other than "a unified whole that coherently seeks particular ends" (where the alignment problem is to make it seek the right ends). The relationship between the AI's mind and the human's mind could instead look more like the relationship between [the stuff in the human's mind that was there only at or after age 10] and [the stuff in the human's mind that was there only strictly before age 10], or the relationship between [one random subset of the organs and tissues and cells in an animal] and [the rest of the organs and tissues and cells in the that animal that aren't in the first set]. (I have very little idea what this would look like, or how to get it, so I have no idea whether it's a useful notion.)

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-16T07:33:40.809Z · LW(p) · GW(p)

it's definitely harder to be an x-risk without superintelligence

Harder but not impossible. Black balls are hypothetical inventions whose very existence (or existence as public information) makes them very likely to be deployed. With nukes for instance we have only a small set of parties who are capable of building them and choose not to deploy.

an island population with a backup of libgen

As a complete aside, that's a really cool hypothetical, I have no idea if that's true though. Lots of engineering depends on our economic and scientific history, costs of materials etc. It's possible that they develop manufacturing differently and different things end up cheaper. Or some scientific / engineering departments are understaffed or overstaffed relative to our world because there just happen to be less or more people interested in them. They would still likely have scientific progress, assuming they solve some of the social coordination problems that we have.

(or some other high-attention relationship that less connotes separate agents, like "weild" or "harmonize with" or something)

a unified whole that coherently seeks particular ends

Interesting, I'd have to think about it.

one random subset of the organs and tissues and cells in an animal

One could say that organs are in fact subagents, they have different goals. Like how different humans have different goals but they cooperate so you can say individual humans are subagents to the human collective as an agent. The difference between the two I guess is that organs are "not intelligent", whatever intelligence they do have is very narrow, and importantly they don't have an inner model for the rest of the world. 

Just wondering, could an AI have an inner model of the world independent from human's inner model of the world, and yet exist in this hybrid state you mention? Or must they necessarily share a common model or significantly collaborate and ensure their models align at all times?

[the stuff in the human's mind that was there only at or after age 10] and [the stuff in the human's mind that was there only strictly before age 10]

Would this be similar to:

humans with sufficiently many metaphorical hands and eyes in the year 2200 could look superintelligent to humans in 2021, same as how our current reasoning capacity in math, cogsci, philosophy etc could look superhuman to cavemen.

Or do you mean something else?

All in all, I think you have convinced me it might be possible :)

Replies from: TekhneMakre
comment by TekhneMakre · 2021-11-16T07:59:02.545Z · LW(p) · GW(p)
One could say that organs are in fact subagents, they have different goals.

I wouldn't want to say that too much. I'd rather say that an organ serves a purpose. It's part of a design, part of something that's been optimized, but it isn't mainly optimizing, or as you say, it's not intelligent. More "pieces which can be assembled into an optimizer", less "a bunch of little optimizers", and maybe it would be good if the human were doing the main portion of the assembling, whatever that could mean.

humans with sufficiently many metaphorical hands and eyes in the year 2200 could look superintelligent to humans in 2021, same as how our current reasoning capacity in math, cogsci, philosophy etc could look superhuman to cavemen

Hm. This feels like a bit of a different dimension from the developmental analogy? Well, IDK how the metaphor of hands and eyes is meant. Having more "hands and eyes", in the sense of the bad poetry of "something you can weild or perceive via", feels less radical than, say, what happens when a 10-year-old meets someone they can have arguments with and learns to argue-think.

Just wondering, could an AI have an inner model of the world independent from human's inner model of the world, and yet exist in this hybrid state you mention? Or must they necessarily share a common model or significantly collaborate and ensure their models align at all times?

IDK, it's a good question. I mean, we know the AI has to be doing a bunch of stuff that we can't do, or else there's no point in having an AI. But it might not have to quite look like "having its own model", but more like "having the rest of the model that the human's model is trying to be". IDK. Also could replace "model" with "value" or "agency" (which goes to show how vague this reasoning is).

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-16T11:06:35.517Z · LW(p) · GW(p)

Got it. Second para makes a lot of sense.

First and last para feel like intentionally deflecting from trying to pin down specifics. I mean your responses are great but still. My responses seem to be moving towards trying to pin some specific things down, yours go a bit in the opposite direction. Do you feel pinning down specifics is a) worth doing? b) possible to do? c) something you wish to do in this conversation?

(I totally understand that defining specifics too rigidly in one way shouldn't blind us to all the other ways we could have done things, but that doesn't by itself mean we shouldn't ever try to define them in different ways and think each of those through.)

Replies from: TekhneMakre
comment by TekhneMakre · 2021-11-16T12:09:54.348Z · LW(p) · GW(p)

a) worth doing?

Extremely so; you only ever get good non-specifics as the result having iteratively built up good specifics.

b) possible to do?

In general, yes. In this case? Fairly likely not; it's bad poetry, the senses that generated are high variance, likely nonsense, some chance of some sense. And alignment is hard and understanding minds is hard.

c) something you wish to do in this conversation?

Not so much, I guess. I mean, I think some of the metaphors I gave, e.g. the one about the 10 year old, are quite specific in themselves, in the sense that there's some real thing that happens when a human grows up which someone could go and think about in a well-defined way, since it's a real thing in the world; I don't know how to make more specific what, if anything, is supposed to be abstracted from that as an idea for understanding minds, and more-specific-ing seems hard enough that I'd rather rest it.

Thanks for noting explicitly. (Though, your thing about "deflecting" seems, IDK what, like you're mad that I'm not doing something, or something, and I'd rather you figure out on your own what it is you're expecting from people explicitly and explicitly update your expectations, so that you don't accidentally incorrectly take me (or whoever you're talking to) to have implicitly agreed to do something (maybe I'm wrong that's what happened). It's connotatively false to say I'm "intentionally deflecting" just because I'm not doing the thing you wanted / expected. Specific-ing isn't the only good conversational move and some good conversational moves go in the opposite direction.)

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-16T13:38:49.560Z · LW(p) · GW(p)

re c): Cool, no worries. I agree it's a little specific.

re last para, you're right that "deflecting" may not have been the best word. Basically I meant you're intentionally moving the conversation away from trying to nail down specifics, which is opposite to the direction I was trying to move it because that's where I felt it would be most useful. I agree that your conversational move may have been useful, I was just wondering if it would be more useful to now start moving in the direction I wanted.

By the end of this conversation I have gotten a vague mental picture that there could possibly exist (in theory) a collective mind that both man and machine inside it. Which answers my original question and is useful. So my probability of "can aligned AI exist" updated a small amount.

But I haven't gotten much specifics on what parts are man and what parts are machine conceptually, or any specifics on how this thing looks or is built physically, or any promising direction to pursue to get there, or even a way to judge if we have in fact gotten there, et cetera et cetera. Hence my probability update is small not large. But my uncertainty is higher.

I agree all this is hard, no worries if we don't discuss it here.

Replies from: TekhneMakre
comment by TekhneMakre · 2021-11-16T18:52:45.522Z · LW(p) · GW(p)

Mostly, all good. (I'm mainly making this comment about process because it's a thing that crops up a lot and seems sort of important to interactions in general, not because it particularly matters in this case.) Just, "I meant you're intentionally moving the conversation away from trying to nail down specifics"; so, it's true that (1) I was intentionally doing X, and (2) X entails not particularly going toward nailing down specifics, and (3) relative to trying to nail down specifics, (2) entails systematically less nailing down of specifics. But it's not the case that I intended to avoid nailing down specifics; I just was doing something else. I'm not just saying that I wasn't *deliberately* avoiding specifics, I'm saying I was behaving differently from someone who has a goal or subgoal of avoiding specifics. Someone with such a goal might say some things that have the sole effect of moving the conversation away from specifics. For example, they might provide fake specifics to distract you from the fact they're not nailing down specifics; they might mock you or otherwise punish you for asking for specifics; they might ask you / tell you not to ask questions because they call for specifics; they might criticize questions for calling for specifics; etc. In general there's a potentially adversarial dynamic here, where someone intends Y but pretends not to intend Y, and does this by acting as though they intend X which entails pushing against Y; and this muddies the waters for people just intending X, not Y, because third parties can't distinguish them. Anyway, I just don't like the general cultural milieu of treating it as an ironclad inference that if someone's actions systematically result in Y, they're intending Y. It's really not a valid inference in theory or practice. The situation is sometimes muddied, such that it's appropriate to treat such people *as though* they're intending Y, but distinguishing this from a high-confidence proposition that they are in fact intending Y (even non-deliberately!) is important IMO.

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-12-09T19:36:48.496Z · LW(p) · GW(p)

Sorry, I missed the notification for this reply.

Thank you for the detailed response, makes a lot of sense. Also I didn't think avoiding nailing down specifics was the main goal you had, just that it was a side-effect.

Communication is hard lol, idk if:

you're intentionally moving the conversation away from trying to nail down specifics

Means "you're intentionally moving the conversation away from trying to nail down specifics with the primary intention of moving the conversation away from specifics"

I guess I'll try being more precise next time, hope I didn't come across as adversarial.

answer by Jon Garcia · 2021-11-15T18:01:56.017Z · LW(p) · GW(p)

The algorithms of good epistemology

Can also equip axiology.

With free energy minimization

Of other-mind prediction,

You can route it to an AI's teleology.

comment by acylhalide (samuel-shadrach) · 2021-11-16T07:13:20.657Z · LW(p) · GW(p)

I tried reading about free energy minimisation on wikipedia, it went past my head. Is there any source or material you would recommend?

Replies from: Jon Garcia
comment by Jon Garcia · 2021-11-16T22:45:41.108Z · LW(p) · GW(p)

Yeah, Friston is a bit notorious for not explaining his ideas clearly enough for others to understand easily. It took me a while to wrap my head around what all his equations were up to and what exactly "active inference" entails, but the concepts are relatively straightforward once it all clicks.

You can think of "free energy" as the discrepancy between prediction and observation, like the potential energy of a spring stretched between them. Minimizing free energy is all about finding states with the highest probability and setting things up such that the highest probability states are those where your model predictions match your observations. In statistical mechanics, the probability of a particle occupying a particular state is proportional to the exponential of the negative potential energy of that state. That's why air pressure exponentially drops off with altitude (to a first approximation, ). For a normal distribution:

the energy is a parabola:

This is exactly the energy landscape you see for an ideal Newtonian spring with rest length  and spring constant . Physical systems always seek the configuration with the lowest free energy (e.g., a stretched spring contracting towards its rest length). In the context of mind engineering,  might represent an observation,  the prediction of the agent's internal model of the world, and  the expected precision of that prediction. Of course, these are all high-dimensional vectors, so matrix math is involved (Friston always uses  for the precision matrix).

For rational agents, free energy minimization involves adjusting the hidden variables in an agent's internal predictive model (perception) or adjusting the environment itself (action) until "predictions" and "observations" align to within the desired/expected precision. (For actions, "prediction" is a bit of a misnomer; it's actually a goal or a homeostatic set point that the agent is trying to achieve. This is what "active inference" is all about, though, and has caused free energy people to talk about motor outputs from the brain as being "self-fulfilling prophecies".) The predictive models that the agent uses for perception are actually built hierarchically, with each level acting as a dynamic generative model making predictions about the level below. Higher levels send predictions down to compare with the "observations" (state) of the level below, and lower levels send prediction errors back up to the higher levels in order to adjust the hidden variables through something like online gradient descent. This process is called "predictive coding" and leads to the minimization of the free energy between all levels in the hierarchy.

My little limerick was alluding to the idea that you could build an AGI to include a generative model of human behavior, using predictive coding to find the goals, policies, instinctual drives, and homeostatic set points that best explain the human's observed behavior. Then you could route these goals and policies to the AGI's own teleological system. That is, make the human's goals and drives, whatever it determines them to be using its best epistemological techniques, into its own goals and drives. Whether this could solve AI alignment would take some research to figure out. (Or just point out the glaring flaws in my reasoning here.)

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-17T05:39:42.452Z · LW(p) · GW(p)

Thanks for typing this out.

Sounds like uploading a human mind and then connecting it to an "intelligence module". (It's probably safer for us to first upload and then think about intelligence enhancement, rather than ask an AGI to figure out how to upload or model us.)

I personally tend to feel that even such a mind would quickly adopt behaviours than you and I find.... alien, and their value system will change significantly. Do you feel that wouldn't happen, and if so do you have any insightas to why?

answer by moridinamael · 2021-11-15T16:56:16.393Z · LW(p) · GW(p)

As I see it there are mainly two hard questions in alignment. 

One is, how do you map human preferences in such a way that you can ask a machine to satisfy them. I don't see any reason why this would be impossible for a superintelligent being to figure out. It is somewhere similar (though obviously not identical) to asking a human to figure out how to make fish happy.

The second is, how do you get a sufficiently intelligent machine to anything whatsoever without doing a lot of terrible stuff you didn't want as a side effect? As Yudkoswky says:

The way I sometimes put it is that I think that almost all of the difficulty of the alignment problem is contained in aligning an AI on the task, “Make two strawberries identical down to the cellular (but not molecular) level.” Where I give this particular task because it is difficult enough to force the AI to invent new technology. It has to invent its own biotechnology, “Make two identical strawberries down to the cellular level.” It has to be quite sophisticated biotechnology, but at the same time, very clearly something that’s physically possible.

This does not sound like a deep moral question. It does not sound like a trolley problem. It does not sound like it gets into deep issues of human flourishing. But I think that most of the difficulty is already contained in, “Put two identical strawberries on a plate without destroying the whole damned universe.” There’s already this whole list of ways that it is more convenient to build the technology for the strawberries if you build your own superintelligences in the environment, and you prevent yourself from being shut down, or you build giant fortresses around the strawberries, to drive the probability to as close to 1 as possible that the strawberries got on the plate.

When I consider whether this implied desiderata is even possible, I just note that I and many others continue to not inject heroin. In fact, I almost never seem to act in ways that look much like driving the probability of any particular number as close to 1 as possible. So clearly it's possible to embed some kind of motivational wiring into an intelligent being, such that the intelligent being achieves all sorts of interesting things without doing too many terrible things as a side effect. If I had to guess, I would say that the way we go about this is something like: wanting a bunch of different, largely incommensurable things at the same time, some of which are very abstract, some of which are mutually contradictory, and somehow all these different preferences keep the whole system mostly in balance most of the time. In other words, it's inelegant and messy and not obvious how you would translate it into code, but it is there, and it seems to basically work. Or, at least, I think it works as well as we can expect, and serves as a limiting case.

comment by acylhalide (samuel-shadrach) · 2021-11-15T17:27:30.998Z · LW(p) · GW(p)

Thanks for replying. I'll try my best to reply.

Firstly I'm not convinced that "human preferences" as a coherent concept even exists for an AGI reasoning about it. Basically we cluster all our inclinations at various points in time into a cluster in thingspace [LW · GW] called "human preferences" when we reasoning about ourselves. An AGI who can very transparently see that we as humans are optimising for no one thing in particular, may use an entirely different model to reason about human behaviour than the ones humans use to reason about human behaviour. This model may not contain a variable called "human preferences" the same way we don't often look at Turing machines and think of "preferences of Turing machine #43". If you can literally run the Turing machine for every iteration till its output, you don't find the need to invent embedded variables and concepts to approximate its output.

Given that I'm not sure about drawing a sharp distinction between a) we will define our preferences as some logically ordered consistent-over-time thing and b) feeding this into an AI, as an alignment strategy.

somehow all these different preferences keep the whole system mostly in balance 

It might be possible that an AGI too has subagents keeping each other in balance, it just isn't clear to me why it can have the exact same subagents that human have - without one of them completely overruling the other.

But tbh maybe you have given me a mental model where the problem looks "not impossible". Idk. 

without doing too many terrible things as a side effect

Just to confirm we agree, "terrible" as a concept exists only inside of the human perspective, things are not moral or immoral outside of it. An AI can easily perceive making paperclips as not terrible and doing anything else as terrible.

answer by lsusr · 2021-11-15T10:20:46.938Z · LW(p) · GW(p)

Human brains are a priori aligned with human values. Human brains are proof positive that a general intelligence can be aligned with human values. Wetware is an awful computational substrate. Silicon ought to work better.

comment by cousin_it · 2021-11-15T14:49:57.025Z · LW(p) · GW(p)

Arguments by definition don't work. If by "human values" you mean "whatever humans end up maximizing", then sure, but we are unstable and can be manipulated, which isn't we want in an AI. And if you mean "what humans deeply want or need", then human actions don't seem very aligned with that, so we're back at square one.

comment by Evenflair (Raven) · 2021-11-16T05:44:43.615Z · LW(p) · GW(p)

Humans aren't aligned once you break abstraction of "humans" down. There's nobody I would trust to be a singleton with absolute power over me (though if I had to take my chances, I'd rather have a human than a random AI).

comment by Alexander (alexander-1) · 2022-01-26T07:43:45.830Z · LW(p) · GW(p)

I like your perspective here, but I don't think it's a given that human brains are necessarily aligned with human 'values'. It entirely depends on how we define human values. Let's suppose that long-term existential risk reduction is a human value (it ranks highly in nearly all moral theories). Because of cognitive limitations and biases, most human brains aren't aligned with this value.

comment by acylhalide (samuel-shadrach) · 2021-11-15T10:32:33.027Z · LW(p) · GW(p)

I see but isn't this reversed? "Human values" are defined by whatever vague cluster of things human brains are pointing at.

Replies from: lsusr
comment by lsusr · 2021-11-15T10:39:58.267Z · LW(p) · GW(p)

Definition implies equality. Equality is commutative. If "human values" equals "whatever vague cluster of things human brains are pointing at" then "whatever vague cluster of things human brains are pointing at" equals "human values".

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-15T10:45:56.770Z · LW(p) · GW(p)

Agreed but that doesn't help. If you tell me that A aligns with B and B is defined as the thing that A aligns to, these statements are consistent but give zero information. And more specifically, zero information about whether some C in Set S can also align with B.

answer by M. Y. Zuo · 2021-11-15T16:30:05.524Z · LW(p) · GW(p)

I believe its possible for AI values to align as much as the least possibly aligned human individuals are aligned with each other. And in my books, if this could be guaranteed, would already constitue a heroic achievement, perhaps the greatest accomplishment of mankind up until that point.

Any greater alignment would be a pleasant fantasy, hopefully realizable if AGIs were to come into existence, but doesn’t seem to have any solid justification, at least not any more than many other pleasant fantasies.

comment by cousin_it · 2021-11-15T16:56:19.937Z · LW(p) · GW(p)

I think a lot of human "alignment" isn't encoded in our brains, it's encoded only interpersonally, in the fact that we need to negotiate with other humans of similar power. Once a human gets a lot of power, often the brakes come off. To the extent that's true, alignment inspired by typical human architecture won't work well for a stronger-than-human AI, and some other approach is needed.

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2021-11-15T17:03:34.507Z · LW(p) · GW(p)

I didn’t mean to suggest that any future approach has to rely on ‘typical human architecture’. I also believe the least possibly aligned humans are less aligned than the least possibly aligned dolphins, elephants, whales, etc…,  are with each other. Treating AGI as a new species, at least as distant to us as dolphins for example, would be a good starting point.

comment by acylhalide (samuel-shadrach) · 2021-11-15T17:12:33.425Z · LW(p) · GW(p)

I see. Keen on your thoughts on following:

Would a human with slightly superhuman intellect be aligned with other humans? What about a human whose intelligence is as unrecognisable to us as we are to monkeys? Would they still be "human"? Would their values still be aligned? 

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2021-11-15T18:00:23.663Z · LW(p) · GW(p)

Well I would answer but the answers would be recursive. I cannot know the true values and alignment of such a superhuman intellect without being one myself. And if I were, I wouldn’t be able to communicate such thoughts with their full strength, without you also being at least equally superhuman to understand. And if we both were, then you would know already.

And if neither of us are, then we can at best speculate with some half baked ideas that might sound convincing to us but unconvincing to said superhuman intellects. At best we can hope that any seeming alignment of values, perceived to the best of our abilities, is actual. Additionally, said supers may consider themselves ’humans’ or not, on criteria possibly also beyond our understanding.

Alternatively, if we could raise ourselves to that level, then case super-super affairs would become the basis, thus leading us to speculate on hyper-superhuman topics on super-Lesswrong. ad infinitum.

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-16T07:06:49.304Z · LW(p) · GW(p)

Got it. That's a possible stance.

But I do believe there exist arguments (/chains of reasoning/etc) that can be understood by and convincing to both smart and dumb agents, even if the class of arguments that a smarter agent can recognise is wider. I would personally hope one such argument can answer the question "can alignment be done?" , either as yes or no. There's a lot of things about the superhuman intellect that we don't need to be able to understand in order for such an argument to exist. Same as how we don't need to understand the details of monkey language or their conception of self or any number of other things, to realise that humans and monkeys are not fully aligned. (We care more about our survival than theirs, they care more about their survival than ours.)

Are you still certain that no such argument exists? If so, why?

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2021-11-16T13:31:11.148Z · LW(p) · GW(p)

In this case we would we be the monkeys gazing at the strange, awkwardly tall and hairless monkeys pondering about them in terms of monkey affairs. Maybe I would understand alignment in terms of whose territory is whose, who is the alpha and omega among the human tribe(s), which bananas trees are the best, where is the nearest clean water source, what kind of sticks and stones make the best weapons, etc.

I probably won’t understand why human tribe(s) commit such vast efforts into creating and securing and moving around those funny looking giant metal cylinders with lots of wizmos at the top, bigger than any tree I’ve seen. Why every mention of them elicits dread, why only a few of the biggest human tribes are allowed to have them, why they need to be kept on constant alert, why multiple  need to be put in even bigger metal cylinders to roam around underwater, etc., surely nothing can be that important right?

If the AGI is moderately above us, than we could probably find such arguments convincing to both, but we would never be certain of them.

If the AGI becomes as far above us as humans to monkeys then I believe the chances are about as likely as us  arguments that could convince monkeys about the necessity of ballistic missile submarines.

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-16T13:50:04.418Z · LW(p) · GW(p)

Okay but the analogue isn't that we need to convince monkeys ballistic missiles are important. It's that we need to convince monkeys that we care about exactly the same things they do. That we're one of them.

(That's what I meant by - there's a lot of things we don't need to understand, if we only want to understand that we are aligned.)

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2021-11-16T15:07:14.810Z · LW(p) · GW(p)

Are you pondering what arguments a future AGI will need to convince humans? That’s well covered on LW. 

Otherwise my point is that we will almost certainly not convince monkeys that ‘we’re one of them‘ if they can use their eyes and see instead of spending resources on bananas, etc., we’re spending it on ballistic missiles, etc. 

Unless you mean if we can by deception, such as denying we spend resources along those lines, etc… in that case I’m not sure how that relates to a future AGI/human scenarios. 

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-16T15:24:10.221Z · LW(p) · GW(p)

they can use their eyes and see instead of spending resources on bananas, etc., we’re spending it on ballistic missiles, etc. 

That certainly acts as a point against us being aligned, in the brain of monkey. (Assuming they could even understand it's us who are building the missiles in front of them.) Maybe you can counteract it with other points in favour. It isn't immediately clear to me why that has to be deceptive (if we were in fact aligned with monkeys). Keen on your thoughts.

P.S. Minor point but you can even deliberately hide the missiles from the monkeys, if necessary. I'm not sure if willful omission counts as deception.

answer by AnthonyC · 2021-11-21T13:47:05.858Z · LW(p) · GW(p)

Which kind of impossible-to-solve do you think alignment is, and why?

Do you mean that there literally isn't any one of the countably infinite set of bit strings that could run as a program on any mathematically possible piece of computing hardware that would "count" as both superintelligent and aligned? That... just seems like a mathematically implausible prior. Even if any particular program is aligned with probability zero, there could still be infinitely many aligned superintelligences "out there" in mind design space. 

Note: if you're saying the concept of "aligned" is itself confused to the point of impossibility, well, I'd agree that I'm at least sure my current concept of alignment is that confused if I push it far enough, but it does not seem to be the case that there are no physically possible futures I could care about and consider successful outcomes for humanity, so it should be possible to repair said concept.

Do you mean there is no way to physically instantiate such a device? Like, it would require types of matter that don't exist, or numbers of atoms so large they'd collapse into a black hole, or so much power that no Kardashev I or II civ could operate it? Again, I find that implausible on the grounds that all the humans combined are made of normal atoms, weigh on the order of a billion tons, and consume on the order of a terrawatt of chemical energy in the form of food, but I'd be interested in any discussions of this question.

Do you mean it's just highly unlikely that humans will successfully find and implement any of the possible safe designs? Then assuming impossibility would seem to make this even more likely, self-fulfilling-prophecy style, no? Isn't trying to fix this problem the whole point of alignment research?

comment by acylhalide (samuel-shadrach) · 2021-11-21T18:47:58.410Z · LW(p) · GW(p)

Thanks for replying.

I think my intuitions are mix of your 3rd and 5th one.

Do you mean it's just highly unlikely that humans will successfully find and implement any of the possible safe designs? Then assuming impossibility would seem to make this even more likely, self-fulfilling-prophecy style, no? Isn't trying to fix this problem the whole point of alignment research?

If the likelihood is sufficiently low, no reasonable amount of work might get you there. Say the odds of aligned AI being built this century are 10^-10 if you do nothing versus 10^-5 if thousands of people devote their lives to it. Versus say 10^-2 odds of eventually unleashing an unaligned AI in the same timeframe. (I'm not saying this is actually the case, just a hypothetical.)

If you're saying the concept of "aligned" is itself confused to the point of impossibility, well, I'd agree that I'm at least sure my current concept of alignment is that confused if I push it far enough, but it does not seem to be the case that there are no physically possible futures I could care about and consider successful outcomes for humanity, so it should be possible to repair said concept.

Maybe I just have intutions stronger than others regarding this. Basically yes, concepts like "values" and "aligned" themselves seem confused to me. In the absence of intuitions, I can totally see your viewpoint of nothing being impossible.

P.S. I started writing up my intuitions but realised they're still not in a readable form, you can view them here [LW(p) · GW(p)] if you're really keen.

answer by AprilSR · 2021-11-15T15:44:10.294Z · LW(p) · GW(p)

I believe it is not literally impossible because… my priors say it is the kind of thing that is not literally impossible? There is no theorem or law of physics which would be violated, as far as I know.

Do I think AI Alignment is easy enough that we’ll actually manage to do it? Well… I really hope it is, but I’m not very certain.

comment by acylhalide (samuel-shadrach) · 2021-11-15T17:06:17.834Z · LW(p) · GW(p)

Got it. I have some priors towards it being near-impossible after reading content about it and adjacent philosophical issues. I can totally see why someone who doesn't have that doesn't need a justification to set a non-zero non-trivial prior.

Replies from: AprilSR
comment by AprilSR · 2021-11-18T22:35:42.968Z · LW(p) · GW(p)

Until I actually see any sort of plausible impossibility argument most of my probability mass is going to be on "very hard" over "literally impossible."

I mean, I guess there's a trivial sense in which alignment is impossible because humans as a whole do not have one singular utility function, but that's splitting hairs and isn't a proof that a paperclip maximizer is the best we can do or anything like that.

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-19T04:05:55.026Z · LW(p) · GW(p)

Humans today all have roughly same intelligence and training history though. It isn't obvious (to me atleast) that human with an extra "intelligence module" will remain aligned with other humans. I would personally be afraid of any human being intelligent enough to unilaterally execute a totalitarian power grab over the world, no matter how good of a person they seem to be.

Replies from: AprilSR
comment by AprilSR · 2021-11-19T18:23:27.355Z · LW(p) · GW(p)

I'm not sure either way on giving actual human beings superintelligence somehow, but I don't think that not working would imply there aren't other possible-but-hard approaches.

Replies from: samuel-shadrach
comment by acylhalide (samuel-shadrach) · 2021-11-20T03:14:39.518Z · LW(p) · GW(p)

Fair, but it acts as a prior against it. If you can't even align humans with each other in the face of an intelligence differential, why will you be to align an alien with all humans?

Or are the two problems fundamentally different in some way?

Replies from: AprilSR
comment by AprilSR · 2021-11-21T02:16:33.132Z · LW(p) · GW(p)

I mean, I agree it'd be evidence that alignment is hard in general, but "impossible" is just... a really high bar? The space of possible minds is very large, and it seems unlikely that the quality "not satisfactorily close to being aligned with humans" is something that describes every superintelligence.

It's not that the two problems are fundamentally different it's just that... I don't see any particularly compelling reason to believe that superintelligent humans are the most aligned possible superintelligences?

Replies from: samuel-shadrach
answer by cousin_it · 2021-11-15T23:10:38.575Z · LW(p) · GW(p)
  1. An AI that consistently follows a utility function seems possible. I can't think of a law of nature prohibiting that.

  2. A utility function is a preference ordering over possible worlds (actually over probability distributions on possible worlds, but that doesn't change the point).

  3. It seems like at least some possible worlds would be nice for humans. So there exists an ordering that puts these worlds on top.

  4. It's plausible that some such worlds, or the diff between them and our world, have reasonably short description.

  5. Conclusion: an AI leading to worlds nice for humans should be possible and have reasonably short description.

The big difficulty of course is in step 4. "So what's the short description?" "Uhh..."

45 comments

Comments sorted by top scores.

comment by shminux · 2021-11-15T18:11:41.375Z · LW(p) · GW(p)

I don't believe alignment is possible. Humans are not aligned with other humans, and the only thing that prevents an immediate apocalypse is the lack of recursive self-improvement on short timescales. Certainly groups of humans happily destroy other groups of humans, and often destroy themselves in the process of maximizing something like the number of statues. Best we can hope for that whatever takes over the planet after meatbags are gone has some of the same goals that the more enlightened meatbags had, where "enlightened" is a very individual definition. Maybe it is a thriving and diverse Galactic civilization, maybe it is the word of God spread to the stars, maybe it is living quietly on this planet in harmony with the nature. There is no single or even shared vision of the future that can be described as "aligned" by most humans.