Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning

so8res

Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning

post by So8res, Ronny Fernandez (ronny-fernandez) · 2023-12-19T23:39:59.689Z · LW · GW · 30 comments

  The basic (counting) argument
  Evolution / Reflection Process is Path Dependent
  Summary and discussion of training an agent in a simulation
  Alignment problem probably fixable, but likely won't be fixed
  Discussing whether this argument about training can be formalized
None
33 comments

Context: somebody at some point floated the idea that Ronny might (a) understand the argument coming out of the Quintin/Nora camp, and (b) be able to translate them to Nate. Nate invited Ronny to chat. The chat logs follow, lightly edited.

The basic (counting) argument

Ronny Fernandez

Are you mostly interested in Quintin's newest post?

I haven’t read it but I don’t suspect it’s his best

So8res

I'm more interested in something like "what are the actual arguments here".

I'm less interested in "ronny translates for others" and more interested in "what ronny believes after having spoken to others", albeit with a focus on the arguments that others are making that various locals allegedly buy.

Ronny Fernandez

Sweet that’s way better

So8res

Options: (a) i start asking questions; (b) you poke me when you wanna chat; (c) you monologue a bit about places where you think you know something i don't

and obviously (d) other, choose your own adventure

Ronny Fernandez

Let's start real simple. Here is the basic argument from my point of view:

If there's a superintelligence with goals very different from mine, things are gonna suck real bad.
There will be a superintelligence.
Its goals will be very different from mine.

Therefore: Things will suck real bad.

I totally buy 1 and 2, and find 3 extremely plausible, but less so than I used to for reasons I will explain later. Just curious if you are down with calling that the basic argument for now.

So8res

works for me!

and, that sounds correct to me

Ronny Fernandez

Some points:

One of the main things that has me believing 3 is a sort of counting argument.
Goals really can just be anything, and we're only selecting on behavior.
Corrigibility is in principle possible, but seems really unnatural.
It makes sense to behave pretty much like a mind with my goals if you're smart enough to figure out what's going on, until you get a good coup opportunity, and you coup.
So like P(good behavior | my values) ~ P(good behavior | not my values) so the real question is P(my values),which seems real small

So8res

I agree with 1-4 and am shaky on 5

4 is slightly non-sequiturish, though true

Ronny Fernandez

4 is to establish 5

good behaviour in training, to be clear

So8res

my objection to 5 is "so the real question is"; i don't super buy the frame; there are things to look at like the precise behavior trajectory and the mind internals, and the "real question" involves that stuff etc.

Ronny Fernandez

(This is going to be easier for me if I let myself devil's advocate more than I think is maximally epistemically healthy. I'm gonna do a bit of that.)

Ok, so here's an analogy on the counting argument. If you were to naively count the ways gas in the room might be, you would find that many of them kill you. This is true if you do max entropy over ways it could be described in English. However, if you do max ent over the parameters of the individual particles in the gas, you find that almost never do they kill you. It's also true that if you count the superintelligent programs of length n, almost all of them kill you, but you shouldn't do max ent over the programs in python or whatever, you should do max ent over the parameters, and then condition that on stochastic gradient descent. This might well tend to always average out to finding a model that straightforwardly tries to cause something a lot like what your loss function is pointing at.

So from my actual point of view, it seems like a lot depends on what the machine learning prior is like, and I don't have much of a clue what that's like

So8res

a thing i agree with: "if you arrange air particles by sampling a program (according to simplicity) and letting that program arrange them, most resulting configurations kill you. if instead you arrange air particles by sampling an entire configuration (sampled uniformly) then most resulting configurations don't kill you." (this is how the physical side of the analogy translates, in my language.)

i don't understand what analogy you're trying to draw from there; i don't understand what things are "programs" and what things are "parameters"

if i sorta squint at your argument, it sounds like you're trying to say something like "i think that you, nate, think that superintelligent goals are likely to be more like a randomly sampled program, but i think that for all we know maybe inner alignment happens basically automatically"

i don't understand how your analogy is supposed to be an argument for that claim though

it seems perhaps worth mentioning that my reasons for expecting inner misaligment are not fundamentally "because i know so little that i must assume the goals are random", but is built of more knowledge than this

Ronny Fernandez

Ok cool, my basic argument is a counting argument

Like basically alignment and corrigibility are high complexity

Disjunction of all other goals plus scheming is much higher weight

So8res

insofar as you're trying to use that argument to be like "this is baby's first argument for other goals being plausible at all, and thus we shouldn't write off the risk", i'm like "sure"

insofar as you're like "and this is the main/strongest argument for the goals turning out elsewise, which i shall now undermine" i'm like "nope"

Ronny Fernandez

Oh nah, this is my primary argument for other goals are much more likely

I think few people do

So8res

(the "plus scheming" also implies to me a difference in our models, i note parenthetically while following the policy of noting each place that something feels off)

Ronny Fernandez

(Agreed scheming is baked in)

Ok cool, I just don’t know the other arguments

So8res

well, this is one place that the analogy from evolution slots in

i could gesture at other arguments, or i could listen to you undermine the argument that you consider primary

to be clear i do think that this primary argument serves as a sort of ignorance prior, later modified by knowledge

Ronny Fernandez

So I always saw the evolution analogy as at best being an existence proof, and a good one, but I don’t see what else it is supposed to tell me

I’m interested in the other arguments and interested in fleshing out the analogy

Especially if we could say it as not an analogy

I also think me and my reward mechanisms (or whatever), which I am similarly very misaligned with, are a good analogy

Evolution / Reflection Process is Path Dependent

So8res

well the rough form of the argument is "goals aren't selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex"

with the force of the argument here going through something like "there are lots of ways for a mind to be architected and few ways for it to be factored around a goal"

at which point we invoke a sort of ignorance prior not over the space of goals but the mechanics of the mind

which is then further (negatively) modified by the practicalities of a mind that must work ok while still stupid

(and at this juncture i interpret the shard theory folk as arguing something like "well the shards that humans build their values up around are very proximal to minds; e.g. perhaps curiosity is instrumentally useful for almost any large-world task and human-esque enjoyment-of-curiosity is actually a near-universal (or at least common) architecture-independent environment-independent strategy for achieving that instrumental value, and we should expect it to get baked into any practical mind's terminal values in the same way it was baked into ours (or at least find this pretty plausible)", or something?)

(which seems kinda crazy to me, but perhaps i don't understand the argument yet and perhaps i shouldn't be trying to run ahead and perhaps i shouldn't be trying to argue against other people through you)

Ronny Fernandez

well the rough form of the argument is "goals aren't selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex"

I’m not sure exactly what I am supposed to get out of this? Minds will tend to be terminally into stuff that is instrumentally useful to the goal of the outer optimizer?

Seems like you’re saying “not random, this other way” what’s the other way?

So8res

that's a fine specific piece, sure. the more general piece is "there are lots and lots of ways for a mind to achieve low training-loss at a fixed capability level"

Ronny Fernandez

Ok yeah agreed. I didn’t mean to say that you’re selecting a goal, should’ve said program in general

So8res

Seems like you’re saying “not random, this other way” what’s the other way?

not sure what you think the difference is between "not random" and "uniformly random but according to a different measure"; from my perspective i'm basically saying "we can move from random over the space of goals, to random over the space of mind architectures, to random over the space of mind architectures that have to perform well-enough while stupid, to random over the space of mind architectures which are trained using some sort of stochastic gradient descent, to random over the space of mind architectures that consist of training [some specific architecture] using [some specific optimizer]" and i'm like "yep it all looks pretty dour to me"

where it seemed to me like you were trying to say something like "i agree that random over general programs is bad, but for all i know, random over mind architectures has a high chance of being good" and i'm like "hmm well it sounds like we have a disagreement then"

Ronny Fernandez

not sure what you think the difference is between "not random" and "uniformly random but according to a different measure"; from my perspective i'm basically saying "we can move from random over the space of goals, to random over the space of mind architectures, to random over the space of mind architectures that have to perform well-enough while stupid, to random over the space of mind architectures when trained using some sort of SGD, to random over the space of mind architectures when training [some specific architecture] using [some specific optimizer]" and i'm like "yep it all looks pretty dour to me"

I have specific reasons for being like, if you were selecting a python program that was superintelligent, even if you got to watch it in simulation for a million years, then we still definitely all die

I thought those same specific reasons carried over to machine learning more than I currently think they do

Ronny Fernandez

where it seemed to me like you were trying to say something like "i agree that random over general programs is bad, but for all i know, random over mind architectures has a high chance of being good" and i'm like "hmm well it sounds like we have a disagreement then"

Specifically for all I know random over parameter space of maybe superintelligent planners conditioned on some straightforward SGD plan is good

I mean, I’m not gonna risk it

But I don’t have mathematical certainty we’re fucked like I do with python programs

So8res

I thought those same specific reasons carried over to ML more than I currently think they do

so here's a thing i believe: p(survival | solar system is sampled randomly from physical configurations) << p(survival | solar system is arranged according to a superintelligent program sampled according to simplicity) << p(survival | solar system is arranged according to a randomly trained mind) <* p(survival | solar system is arranged according to a random evolved alien species)

it sounds like there's maybe some debate about the strength of the <*

Ronny Fernandez

I assume you mean not randomly trained, but just that we keep doing the same thing we’ve been doing

So8res

yeah, sorry, "sampled randomly from the space of trained minds"

Ronny Fernandez

Yeah cool, so I agree with all of them. To be clear, trained by humans who are trying to take over the world and haven’t thought about this, let's say

So8res

attempting to distinguish two hypotheses you might be arguing for: are you arguing for something more like (a) maybe lots of trained minds happen to be nice (e.g. b/c curiosity always ends up in them in the same way); or (b) maybe a little bit of 'design' (in the sense of that-which-humans-do-and-evolution-does-not) goes a long way

Ronny Fernandez

The second one

Not the first at all

But I don’t have mathematical certainty we’re fucked like I do with python programs

This is where I’m at. Like I know we’re fucked if you select python programs using behavior only

So8res

and is the idea more like

"with a little design effort, getting curiosity in them in the right way is actually easy"
"with a little design effort, maybe we can make limited corrigible things that we can use to do pivotal acts, without needing to load things like curiosity";
"with a little design effort, maybe we can load all sorts of other things, unlike curiosity, that still add up to something Friendly"?

Ronny Fernandez

It’s more like SGD is some sort of magic, that for some reason has some sort of prior that doesn’t kill us. Like for instance, maybe scheming is very penalized because it takes longer and ML penalizes run time

So8res

(if that's actually supposed to carry weight, perhaps we do need to drill down on the 'scheming' stuff, previously noted as a place where i suspect we diverge)

Ronny Fernandez

It does seem kinda crazy for it to be that big of an advantage

well the rough form of the argument is "goals aren't selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex"

I want to get into this again. So does it seem right to you to say that the main point of the evolution analogy is that all sorts of random shit will do very well on your loss function at a given capability level? Then if the thing gets more capability, you realize that it did not internalize the loss as it starts getting the random shit?

So8res

more-or-less

Ronny Fernandez

What’s missing or subtly wrong

So8res

a missing piece: the shit isn't random, it's a consequence of the mind needing to achieve low-ish loss while dumb

Ronny Fernandez

Does that tell us something else risk relevant?

So8res

which is part of a more general piece, that the structure of the mind happens for reasons, and the reasons tend to be about "that's the shortest pathway to lower loss given the environment and the capability level", and once you see that there are all sorts of path-dependent specific shortcuts it starts to seem like the space of possible mind-architectures is quite wide

then there are deeper pieces, zooming into the case of humans, about how the various patched-on pieces are sometimes in conflict with each other and other hacks are recruited to resolve those conflicts, resulting in this big dynamic system with unknown behavior under reflection

Ronny Fernandez

Can you give two examples of path dependent specific shortcuts for the same loss function?

So8res

sure

hunger, curiosity

Ronny Fernandez

Right

Hmm

Ok i was imagining like, maybe to breed you get really into putting your penis into lips, or maybe you get really into wrapping your penis in warm stuff

hunger, curiosity

So they aren’t mutually exclusive?

These aren’t like training histories

What are the paths that hunger and curiosity are dependent on?

So8res

maybe i don't understand your question, but yeah, the sort of thing i'm talking about is "the easiest way to perturb a mind to be slightly better at achieving a target is rarely for it to desire the target and conceptualize it accurately and pursue it for its own sake"

Ronny Fernandez

Ahh nice that’s very helpful

So8res

there's often just shortcuts like "desire food with appropriate taste profiles" or whatever

the specifics of hunger are probably pretty dependent on the specifics of biology and available meals in the environment of evolutionary adaptedness

i wouldn't be surprised if the specifics of curiosity were dependent on the specifics of the social pressures that shaped us

(though also, more generally, it being curiosity-per-se that got promoted to terminal, as opposed to a different cut of the possible instrumental strategies being promoted, seems like a roll of the dice to me)

Ronny Fernandez

The thing I think Quintin successfully criticizes is the analogy as an n = 1 argument for misalignment by default, which to be fair was already a very weak argument

So8res

also i suspect that curiosity is slightly more likely to be something that random minds absorb into their terminal goals, depending how those dice come up.

things like "fairness" and "friendship" seem way more dependent on the particulars of the social environment in the environment of evolutionary adaptedness

and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection

Ronny Fernandez

and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection

Says the CEV guy

So8res

though another big chunk of my eyebrow-raising comes from the implicit hypothesis that "absorb a human-like slice of the instrumental values into terminal values in a human-like way" is a particularly generic way to do things even in wildly different architectures under wildly different training regimes

Ronny Fernandez

Says the CEV guy

(We don’t need to open that can or worms now, but I would like to some day)

though another big chunk of my eyebrow-raising comes from the implicit hypothesis that "absorb a human-like slice of the instrumental values into terminal values in a human-like way" is a particularly generic way to do things even in wildly different architectures under wildly different training regimes

Ok yeah, I also think that’s bs

and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection

And agree with this

So8res

here i have some sense of, like, "one could argue that all land-based weight-carrying devices must share properties of horses all day long, before their hypothesis space has been suitably widened"

Ronny Fernandez

I’m like look, I used to think the chances of alignment by default were like 2^-10000:1

So8res

(We don’t need to open that can or worms but I would like to some day)

(yeah seems like a tangent here, but i will at least note that "all architectures and training processes lead to the absorption of instrumental values into terminal values in a human-esque way, under a regime of human-esque reflection" and "most humans (with fully-functioning brains) have in some sense absorbed sufficiently similar values and reflective machinery that they converge to roughly the same place" seem pretty orthogonal to me)

Ronny Fernandez

I now can’t give you a number with anything like the mathematical precision I used to think I could give

So8res

ehh it feels to me like i can get you more than 100:1 against alignment by default in the very strongest sense; i feel like my knowledge of possible mind architectures (and my awareness of stochastic gradient descent-accessible shortcut-hacks) rules out "naive training leads to friendly AIs"

probably more extreme than 2^-100:1, is my guess

it seems to me like all the room for argument is in "maybe with cleverness and oversight we can do much better than naive training, actually"

Ronny Fernandez

I’m like look, I used to think the chances of alignment by default were like 2^-10000:1

I still think I can do this if we’re searching over python programs

So8res

The thing I think Quintin successfully criticizes is the analogy as an n = 1 argument for misalignment by default, which to be fair was already a very weak argument

(yeah i have never really had the sense that the evolutionary arguments he criticizes are the ones i'm trying to make)

I still think I can do this if we’re searching over python programs

yeah sure

Ronny Fernandez

and is the idea more like
"with a little design effort, getting curiosity in them in the right way is actually easy"
"with a little design effort, maybe we can make limited corrigible things that we can use to do pivotal acts, without needing to load things like curiosity";
"with a little design effort, maybe we can load all sorts of other things, unlike curiosity, that still add up to something Friendly"?

2 seems best to me right now

So8res

it seems maybe worth saying that my model says: i expect the naive methods to take lots of hacks and shortcuts and etc., such that it'd betray-you-if-scaled in a manner that would be clear if you knew how to look and interpret what you saw, and i mostly expect humanity to die b/c i expect them to screw their eyes shut in the relevant ways

Ronny Fernandez

Ok yeah that seems plausible

So8res

and in particular, if you could figure out how these minds were working, and see all the shortcuts etc. etc., you could probably figure out how to do the job properly (at least to the bar of a minimal pivotal task)

this is part of what i mean by "i don't think alignment is all that hard"

my high expectation of doom comes from a sense that there's lots of hurdles and that humanity will flub at least one (and probably lots)

so insofar as you're trying to argue something like "masters of mind and cognition would maybe not have a challenge" i'm like "yeah sure"

(though insofar as you're arguing something like "maybe naive techniques work" i'm like "i think i see enough hacky shortcuts that hill-climbing-like approaches can take, and all the 'clever' ideas people propose seem to me to just wade around in the sea of hacky shortcuts, and i don't personally have hope there")

i shall now stfu

Summary and discussion of training an agent in a simulation

Ronny Fernandez

Ok, so, I want to summarize where we are. Here are some things that seem important to me:

We agree roughly about what happens if you select a python program for superintelligent good behavior: you almost always end up with an unaligned mind that will coup eventually.

I was like well, Quintin convinced me that the prior over models is very different from the prior over python programs.

I put the main argument in terms of prior. You basically were like nah, it's not just the prior, it also matters a lot that SGD is going to do things incrementally. Most incremental changes you can make to a mind to achieve a certain loss are not going to cause the mind to be into the loss itself.

So8res

(i'm not entirely sure what "select for superintelligent good behavior" means; i'd agree "simplicity-sampled python superintelligences kill you (if you have enough compute to run them and keep sampling until you get one that does anything superintelligent at all)" and if you want to say "that remains true if you condition on ones that behave well in a training setup" then i'd need to know what 'well' means and what the training setup is. but i expect this not to be a sticking point.)

that sounds roughly right-ish to me,

though i don't really understand where you draw the distinction between "the prior over models (selected by SGD)" and "arguments about how incremental changes are likely to affect minds"

Ronny Fernandez

(i'm not entirely sure what "select for superintelligent good behavior" means; i'd agree "simplicity-sampled python superintelligences kill you" and if you want to say "that remains true if you condition on ones that behave well in a training setup" then i'd need to know what 'well' means and what the training setup is. but i expect this not to be a sticking point.)

I mean you get to watch them in simulation but you do not get to read/understand the code

So8res

like, from my perspective, the arguments about incrementality are sorta arguments about what the prior over (SGD-trained) models looks like

but also i don't care / i'm not bidding we go into it, i'm just noting where things seem not-quite-how-i-would-dice-them-up

Ronny Fernandez

Interesting, I'm thinking of them as like, being part of P(model | data) rather than P(model)

So8res

i'd also additionally note that the point i'm trying to drive at here is a little less like "incremental changes don't make the mind care about loss" and a little more like "the prior is still really wide, so wide that a counting argument still more-or-less works"

I mean you get to watch them in simulation but you do not get to read/understand the code

(sure, but like, how ironclad is the simulation and what are you watching them do?)

Ronny Fernandez

i'd also additionally note that the point i'm trying to drive at here is a little less like "incremental changes don't make the mind care about loss" and a little more like "the prior is still really wide, so wide that a counting argument still more-or-less works"

I mean, this seems intuitively plausible to me, but I wouldn't be able to convince a reasonable person to who it was not intuitively plausible

So8res

the place where the argument is supposed to have force is related to "you can argue all you want that any flying device will have to flap its wings, and that won't constrain airplane designs"

i'm not sure whether you're saying something like "i don't believe that that's the actual part of the argument that has force, and hereby query you to articulate more of the forceful parts of the argument", vs whether you're saying things like "that argument is not in a format accepted by the Unconvinceable Hordes, despite being valid and forceful to me", or...?

So8res

(but for the record, insofar as you're like "do you expect that to convince the Unconvinceable Hordes?" i'm mostly like "no, mr bond, i expect us to die")

Ronny Fernandez

No no, I'm imagining convincing Ronnys with only slightly different intuitions or histories. Such people are much more convinceable

So8res

More like the first, and it's more like I don't quite understand where the force comes from

do you have much experience with programming?

Ronny Fernandez

Only python and only like pretty junior level

So8res

do you have familiarity with, like, the sense that a task seems straightforward, only to find an explosion of options in the implementation?

Ronny Fernandez

I don't know, I implemented gpt 2 with tutors once

do you have familiarity with, like, the sense that a task seems straightforward, only to find an explosion of options in the implementation?

yeah

maybe not enough

For sure with airtables and zapier automations actually

So8res

cool

and relatedly, consider... arguments that airplanes must flap their wings. or arguments that computers shouldn't be able to run all that much faster than brains. or arguments that robots must run on something kinda like a metabolic system.

where the point in those examples is not just "artificial things work with different underlying mechanics than biological things" but that there are lots of ways to do things, including ones that start way outside your hypothesis space

Ronny Fernandez

in fact artificial things do not work with different underlying mechanics, there's just lots of mechanics and it rarely turns out that we do it the same way

So8res

right

and when you don't understand in detail how two (or possibly even three) different things work then you're likely to dramatically underestimate the width of the space

perhaps i am moving too fast here. it sounded to me like you were like "the prior over models is different than the prior over programs" and i was like "yep" and then you were like "so there's an appreciable chance i'll win the lottery" and i was like "?? no" and you were like "wait why not?" and i was like "because the space is still real wide"

Ronny Fernandez

Yeah, I think definitely any space of models large enough to contain superintelligent aligned things also contains lots and lots of superintelligent non-aligned things

Alignment problem probably fixable, but likely won't be fixed

So8res

"SGD makes incremental changes, and the minds have to work while dumb, and there are lots of ways for SGD to make a mind work better while dumb that don't do the thing you want" is an argument that's (a) correct in its own right, but also (b) sheds light on how many ways to do the job wrong there are

from which, i claim, it's proper to generalize that the space is wide; to see that arguments of the form "maybe i win the lottery" are basically analogous to arguments of the form "maybe human minds are near the limit of physical constraints on cognition"

Ronny Fernandez

Of the form?

Not just of the quality?

So8res

i'm not sure what distinction you're trying to draw there

Ronny Fernandez

Like they're analogous somehow

So8res

the arguments seem to me similarly valid (and in particular, invalid)

like, the "SGD makes incremental changes" is one plausible-feeling example of how if you really understood what was going on inside that mind, you'd cry out in terror

So8res

from which we generalize not that you'll see exactly that thing, but that you will in fact cry out in terror

when there's a plausible way for code to have the cry-out-in-terror property, it very likely will unless counter-optimization was somehow applied

Ronny Fernandez

Is the main problem here like that you end up with something that will coup you later or something that will build things that will kill you later/get smarter and then start wearing condoms

So8res

so my argument is not "and this survives all counter-optimization"

my reason for expecting doom is not that i think this problem is unfixable, it's that i think it won't be addressed

So8res

that said, my guess is that it will take something much more like "understand the mind" than "provide better training"

but, like, the argument against "just put a bit of thought into the training" working has a bunch less force than the argument against "just train it to be good" working

(still, i think, considerable force, but)

Ronny Fernandez

but does survive all counter optimization selecting only on behavior, or no?

So8res

Is the main problem here like that you end up with something that will coup you later or something that will build things that will kill you later/get smarter and then start wearing condoms

not sure i see the distinction

Ronny Fernandez

Like, humans weren't waiting around for a certain number to be factored until they couped evolution, they're more like the second thing

So8res

that seems to me that it's more of a fact about evolution not watching them and slapping down visible defiance, than a fact about human psychology?

(or, well, a fact about "evolution not slapping down visible defiance" plus a fact about "humans not yet being smart enough to coordinate to overcome that", but)

Ronny Fernandez

Yeah, like if evolution were very shortsighted, I think it should be happy with early us

I think similarly, if we are very shortsighted, we might be happy with early models before they're capable enough that the divergence between what we wanted and they want is apparent

So8res

sure

...insofar as there's a live question, i still don't understand it

Ronny Fernandez

Well this is different from: you get a superintelligence, and it's like "hmm, I'm not sure if I'm in training or not, let me follow a strategy that maximizes my chance of couping when not in training"

So8res

if you're like "what goes wrong if you breed chimps to be better at inclusive genetic fitness and also smarter" then i'm mostly like "a chimp needs to eat long before it can conceptualize calories; the hunger thing is going to be really deep in there" (or, more generally, you'd get some mental architecture that solves your training problem while being unlike how you wanted, but I'll use that example for now).

could you in principle breed them to the point that they stop having a hunger drive and start hooking in their caloric-intake to their explicit-model of IGF? probably, but it'd probably take (a) quite a lot of training, and (b) a bunch of visibility into the mind to see what's working and what's not.

mostly if you try that you die to earlier generations that rise up against you; if not then you die to the fact that you were probably measuring progress wrong (and getting things that still deeply enjoy eating nice meals but pretend they don't b/c that's what it turns out you were training for)

i doubt that the rising-up ever needs to depend on factoring a large number; that only happens if the monkeys think you're extremely good at spoofing their internal states, and you aren't (in this hypothetical where you don't actually understand much of what's going on in their minds)

but whether it happens right out in the open (because you, arguendo, don't understand their minds well enough to read those thoughts) or whether it feels like a great betrayal (e.g. because they were half-convinced that they were your friends, and only started piecing things together once they got smarter) feels like... i dunno, could go either way

(cf planecrash, i think that big parts of planecrash were more-or-less about this point)

Well this is different from: you get a superintelligence, and it's like "hmm, I'm not sure if I'm in training or not, let me follow a strategy that maximizes my chance of couping when not in training"

yeah this ~never happens, especially if you haven't attained mastery of their mind

it's maybe possible if you take the "master mind" route, though i really would not recommend it; if you have that kind of mastery you should have better options available

Discussing whether this argument about training can be formalized

So8res

shall we make this into an LW dialog of some sort? push for more formality?

Ronny Fernandez

Yeah, I’m down. Let’s do both.

So8res

seems kinda hard to make something formal to me because the basic argument is, i think, "there's really a lot of ways for a model to do well in training", but i don't know how one is supposed to formalize that. i guess i'm curious where you think the force of formality comes in for the analogous argument when it comes to python programs

Ronny Fernandez

I think it’s like if someone asks me why do you think the probability is so low? I can explain that I have an argument that it’s ridiculously low that I have a fair amount of weight on.

So8res

sure, but what is the formal argument (for python programs) that it's ridiculously low?

like, suppose someone says "you don't know what python programs conditioned on apparent-superintelligence and apparent-goodness do, why should i believe you?"

Ronny Fernandez

Ahh great, I did write a post about this but:

Uhh suppose that you have arbitrary finite compute, but no mind access. You get to select programs by running them in an extremely high fidelity environment that even a superintelligence can’t distinguish from reality. You run them from some lab and then watch for five years. Almost always nothing happens, so you go to the next program. Interestingly, you do eventually find superintelligent models but they almost always act very aligned.

Why is this? Because P(acting aligned for 5 years | simulation and not aligned) ~ P(acting aligned for 5 years | simulation and aligned). So, we are stuck figuring out P(aligned).

Well, human values are complicated. It takes lots of bits to specify them. So it's unlikely that you end up with a python program that has that as goals, since literally any mind with any other goal would also act that way in the simulation.

So say it take 1000 bits to specify human values, then you know you have roughly 2^-1000. Because your sim only gives you evidence of superintelligence, not of sharing the particular goal.

(The argument is similar for P(corrigible).)

So8res

i'm slightly hung up here on how it doesn't just hack its way through the outer observers but

Ronny Fernandez

Yeah I don't know, magic

So8res

and what would you say to the objection that the observers can simply wait and watch for something that professes human values from the get-go, rather than waiting to profess its values until it looks around and figures out what the observers want?

Ronny Fernandez

Well back when I used to think this was analogous-ish to machine learning, I would have said that won’t happen

It will already have info about what you want

It might for instance have a record of what programs you vetoed before

So8res

is the claim "misaligned AIs will figure out what you want so quickly that you won't be able to notice their hesitation"?

Ronny Fernandez

No no, it’s that if you were training the boy, he would know. You’re not actually, but it’s supposed to be an analogy, so you give him the power

So8res

wait, so you're like "we're going to take arbitrary python programs, credibly show them they're under observation for 5y, ask them to do nice things for us, and then set them loose if they did in fact do the nice things"?

Ronny Fernandez

Well, they figure out they’re in the simulation. They do have to figure that bit out. But they don’t have to figure out that they’re being watched by humans

And I’m like you can do whatever, but you don’t have mind access, and you have arbitrary finite compute. You have to select a superintelligent program using behavior. You’re fucked no matter what you do

So8res

but the force of the argument here is supposed to come from "the observation that it does nice things while under our thumb is not much evidence about whether it'll do nice things when unleashed"?

Ronny Fernandez

but the force of the argument here is supposed to come from "the observation that it does nice things while under our thumb is not much evidence about whether it'll do nice things when unleashed"

That and that P(aligned or corrigible) is tiny

Also we don’t have to tell it it’s under our thumb, we just have to tell it we’re humans

So8res

but suppose someone says: aha, but we are training the boy, and so this argument doesn't have nearly the force of 2^-1000, because there exist python programs that, in less than 1000 bits, say "optimize whatever concept you're being trained towards"

Ronny Fernandez

Yeah or similar. I think it’s much less analogous than I used to think it was, but the broad structure I think is in some ways similar to the broad structure of the argument you gave

At the level of like your argument is still, your data isn’t much evidence, and the prior of your favored outcome is tiny

So8res

(yeah i think my state is something like "old argument was strong but not that strong; new argument is strong but not that strong" and i can't tell whether you're like "i now agree (but used to not)" or whether you're like "it still looks to me like old argument was super strong, new argument is comparably weak")

Ronny Fernandez

Old argument is still strong for python programs, is weak as analogy for machine learning. I want comparably strong argument for ML

Or like, I want to dig in on why the evidence is weak, and why the prior is small in ML. No analogy

So8res

(albeit not as much shorter as one might hope; it probably takes a lot of humane values to figure out "what we mean" in a humane way)

(this is essentially the observation that indirect normativity is probably significantly easier that fully encoding our values, albeit still not easy)

(perhaps you're like "eh, it still seems it should take hundreds if not thousands of bits to code for indirect normativity"?, to which i'd be like "sure maybe", as per the first parenthetical caveat)

same point from a different angle: the strength of the argument is not based on the K-complexity of our values, it's based on the cross-entropy between our values and the training distribution

Ronny Fernandez

same point from a different angle: the strength of the argument is not based on the K-complexity of our values, it's based on the cross-entropy between our values and the training distribution

Interesting

I mean even if it was 3 bits, 7/8 times we’re fucked

So8res

yeah totally (i was thinking of saying that myself)

Ronny Fernandez

Good enough for me

So8res

so part of why i'm drilling on this here is something like "i suspect that model-space and program-space are actually pretty similar analogy-wise, and that the thing where your intuitions treat them very differently is that when you think of training models, for some reason that makes the difference between K-complexity(values) and cross-entropy(training-data, values) more salient

though i guess you might be like "SGD is a really different prior from program length"

Ronny Fernandez

(another way of saying the observation is that the strength of the argument is not based on the K-complexity of our values, it's based on the cross-entropy between our values and the training distribution)

I wanna understand this point more then. That’s interesting

tho i guess you might be like "SGD is a really different prior from program length"

Yeah that’s right

Or like model space is

Maybe also SGD is unlike Bayesian updating

So8res

yeah i don't understand the "model space is different" thing, like, models are essentially just giant differentiable computation graphs (and they don't have to use all that compute); i don't see what's so different between them and python programs

(it sounds almost like someone saying "ok i see how this argument works for python, but i don't understand how it's supposed to work for C" or something)

though "well we search the space very differently" makes sense to me

Ronny Fernandez

Well for one, they’re of finite run time

That seems pretty importantly different

So8res

so does your whole sense of difference go out the window if we do something autogpt-ish?

Ronny Fernandez

Let me think

It’s still weird in that you’re selecting a finite run time thing and then iterating that exact thing

So8res

sure

does the difference go out the window once people are optimizing in part according to the auto-ized version's performance?

Ronny Fernandez

Yeah it sure starts to for me? I feel like I’ll talk to Quintin at some point and then he’ll make me not feel that way, though

So8res

and: how about "runs for long enough, e.g. by doing a finite-but-large number of loops though a fixed architecture"?

Ronny Fernandez

and: how about "runs for long enough, e.g. by doing a finite-but-large number of loops though a fixed architecture"?

How's this different from the last one?

So8res

/shrug, it's not supposed to be a super sharp line, but on one end of the spectrum you could imagine lower-level loops/recurrence in training (after some architecture tweaks), and on the other end of the spectrum you could imagine language models playing a part in larger programs a la auto-GPT

also, if runtime got long enough, would it stop mattering?

Ronny Fernandez

I mean definitely if it got long enough

Enough might be really big

There’s programming languages that compile into transformers. I wonder what they’re like

So8res

cool. so if we're like "well, SGD may find different programs, and also we're currently selecting over programs for their ability to perform a single pretty-short pass well", then i'm like: yep those seem like real differences

i agree that if #2 holds up then that could shake things up a fair bit.

but insofar as your point is supposed to hold even if #2 falls, it seems to me that you're basically saying that the cross-entropy between the training distribution and human values might be way smaller when we sample according to SGD rather than when we sample according to program length

i suspect that's false, personally

though also i guess i'll pause and give you an opportunity to object to this whole frame

Ronny Fernandez

I'm a bit worried about Quintin feeling misrepresented by me so I guess I should say that I am emphatically not representing Quintin here. I def want to say something like I'm sure Quintin would be much more persuasive to me than I was to myself, and that if Quintin were sitting next to me coaching me, I would've been much more convincing to everyone overall. I'm pretty confident of that.

I think the best thing for me to do here is to go off and read some more things that are optimistic about good results from scaling ML to superintelligence, and then come back and have another conversation with you.

30 comments

Comments sorted by top scores.

comment by DavidW (david-wheaton) · 2023-12-21T01:48:22.478Z · LW(p) · GW(p)

Nate, please correct me if I'm wrong, but it looks like you:

Skimmed, but did not read [LW · GW], a 3,000-word essay
Posted a 1,200-word response that clearly stated that you hadn't read it properly
Ignored a comment [LW(p) · GW(p)] by one of the post's authors saying you thoroughly misunderstood their post and a comment [LW(p) · GW(p)]by the other author offering to have a conversation with you about it
Found a different person to talk to about their views (Ronny), who also had not read their post
Participated in a 7,500-word dialogue with Ronny in which you speculated about what the core arguments of the original post might be and your disagreements

You've clearly put a lot of time into this. If you want to understand the argument, why not just read the original post and talk to the authors directly? It's very well-written.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2023-12-21T04:18:59.074Z · LW(p) · GW(p)

I don't want to speak for Nate, and I also don't want to particularly defend my own behavior here, but I have kind of done similar things around trying to engage with the "AI is easy to control" stuff.

I found it quite hard to engage with directly. I have read the post, but I would not claim to be able to be close to passing an ITT of its authors and bounced off a few times, and I don't currently expect direct conversation with Quintin or Nora to be that productive (though I would still be up for it and would give it a try).

So I have been talking to friends and other people in my social circle who I have a history of communicating well with about the stuff, and I think that's been valuable to me. Many of them had similar experiences, so in some sense it did feel like a group of blind men groping around an elephant, but I don't have a much better alternative. I did not find the original post easy to understand, or the kind of thing I felt capable of responding to.

I would kind of appreciate better suggestions. I have not found just forcing myself to engage more with the original post to help me much. Dialogues like this do actually seem helpful to me (and I found reading this valuable).

Replies from: Zack_M_Davis, leogao, TurnTrout

↑ comment by Zack_M_Davis · 2023-12-21T05:49:49.221Z · LW(p) · GW(p)

How much have you read about deep learning from "normal" (non-xrisk-aware) AI academics? Belrose's Tweet-length argument against deceptive alignment sounds really compelling to the sort of person who's read (e.g.) Simon Prince's textbook but not this website. (This is a claim about what sounds compelling to which readers rather than about the reality of alignment, but if xrisk-reducers don't understand why an argument would sound compelling to normal AI practitioners in the current paradigm, that's less dignified than understanding it well enough to confirm or refute it.)

↑ comment by leogao · 2023-12-21T08:19:48.840Z · LW(p) · GW(p)

I think I could pass the ITTs of Quintin/Nora sufficiently to have a productive conversation while also having interesting points of disagreement. If that's the bottleneck, I'd be interested in participating in some dialogues, if it's a "people genuinely trying to understand each other's views" vibe rather than a "tribalistically duking it out for the One True Belief" vibe.

↑ comment by TurnTrout · 2023-12-21T19:27:42.988Z · LW(p) · GW(p)

This is really interesting, because I find Quintin and Nora's content incredibly clear and easy to understand.

As one hypothesis (which I'm not claiming is true for you, just a thing to consider)—When someone is pointing out a valid flaw in my views or claims, I personally find the critique harder to "understand" at first. (I know this because I'm considering the times where I later agreed the critique was valid, even though it was "hard to understand" at the time.) I think this "difficulty" is basically motivated cognition.

Replies from: habryka4, aysja

↑ comment by habryka (habryka4) · 2023-12-21T19:36:45.529Z · LW(p) · GW(p)

I am a bit stressed right now, and so maybe am reading your comment too much as a "gotcha", but on the margin I would like to avoid psychologizing of me here (I think it's sometimes fine, but the above already felt a bit vulnerable and this direction feels like it disincentivizes that). I generally like sharing the intricacies and details of my motivations and cognition, but this is much harder if this immediately causes people to show up to dissect my motivations to prove their point.

More on the object-level, I don't think this is the result of motivated cognition, though it's of course hard to rule out. I would prefer if this kind of thing doesn't become a liability to say out loud in contexts like this. I expect it will make conversations where people try to understand where other people are coming from go better.

Sorry if I overreacted in this comment. I do think in a different context, on maybe a different day I would be up for poking at my motivations and cognition and see whether indeed they are flawed in this way (which they very well might be), but I don't currently feel like it's the right move in this context.

Replies from: TurnTrout

↑ comment by TurnTrout · 2023-12-26T18:49:40.499Z · LW(p) · GW(p)

I think it's sometimes fine, but the above already felt a bit vulnerable and this direction feels like it disincentivizes that

FWIW, I think your original comment was good and I'm glad you made it, and want to give you some points for it. (I guess that's what the upvote buttons are for!)

↑ comment by aysja · 2023-12-23T00:24:08.233Z · LW(p) · GW(p)

Fwiw, I generally find Quintin’s writing unclear and difficult to read (I bounce a lot) and Nora’s clear and easy, even though I agree with Quintin slightly more (although I disagree with both of them substantially).

I do think there is something to “views that are very different from ones own” being difficult to understand, sometimes, although I think this can be for a number of reasons. Like, for me at least, understanding someone with very different beliefs can be both time intensive and cognitively demanding—I usually have to sit down and iterate on “make up a hypothesis of what I think they’re saying, then go back and check if that’s right, update hypothesis, etc.” This process can take hours or days, as the cruxes tend to be deep and not immediately obvious.

Usually before I’ve spent significant time on understanding writing in this way, e.g. during the first few reads, I feel like I’m bouncing, or otherwise find myself wanting to leave. But I think the bouncing feeling is (in part) tracking that the disagreement is really pervasive and that I’m going to have to put in a bunch of effort if I actually want to understand it, rather than that I just don't like that they disagree with me.

Because of this, I personally get a lot of value out of interacting with friends who have done the “translating it closer to my ontology” step—it reduces the understanding cost a lot for me, which tends to be higher the further from my worldview the writing is.

Replies from: thomas-kwa

↑ comment by Thomas Kwa (thomas-kwa) · 2023-12-23T00:58:11.081Z · LW(p) · GW(p)

Yeah, for me the early development of shard theory work was confusing for similar reasons. Quintin framed values as contextual decision influences and thought these were fundamental, while I'd absorbed from Eliezer that values were like a utility function. They just think in very different frames. This is why science is so confusing until one frame proves useful and is established as a Kuhnian paradigm.

comment by Richard_Ngo (ricraz) · 2023-12-20T07:30:05.298Z · LW(p) · GW(p)

ehh it feels to me like i can get you more than 100:1 against alignment by default in the very strongest sense; i feel like my knowledge of possible mind architectures (and my awareness of stochastic gradient descent-accessible shortcut-hacks) rules out "naive training leads to friendly AIs"
probably more extreme than 2^-100:1, is my guess

What is the 2^-100:1 part intended to mean? Was it a correction to the 100:1 part or a different claim? Seems like an incredibly low probability.

Separately:

Ronny Fernandez
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
I still think I can do this if we’re searching over python programs

This seems straightforwardly insane to me, in a way that is maybe instructive. Ronnie has updated from an odds ratio of 2^-10000:1 to one that is (implicitly) thousands of orders of magnitude different, which should essentially never happen. Ronnie has just admitted to being more wrong than practically anyone who has ever tried to give a credence. And then, rather than being like "something about the process by which I generate 2^-10000:1 chances is utterly broken", he just.... made another claim of the same form?

I don't think there's anything remotely resembling probabilistic reasoning going on here. I don't know what it is, but I do want to point at it and be like "that! that reasoning is totally broken!" And it seems to me that people who assign P(doom) > 90% are displaying a related (but far less extreme) phenomenon. (My posts about meta-rationality [? · GW] are probably my best attempt at actually pinning this phenomenon down, but I don't think I've done a great job so far.)

Replies from: So8res, peterbarnett, ronny-fernandez, None, ronny-fernandez

↑ comment by So8res · 2023-12-20T19:15:24.910Z · LW(p) · GW(p)

my original 100:1 was a typo, where i meant 2^-100:1.

this number was in reference to ronny's 2^-10000:1.

when ronny said:

I’m like look, I used to think the chances of alignment by default were like 2^-10000:1

i interpreted him to mean "i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment".

i personally think this is wrong, for reasons brought up later in the convo--namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.

but this was before i raised that objection, and my understanding of ronny's position was something like "specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models". to which i was attempting to reply "man, i can see enough ways that ML models could turn out that i'm pretty sure it'd still take at least 100 bits".

i inserted the hedge "in the very strongest sense" to stave off exactly your sort of objection; the very strongest sense of "alignment-by-default" is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it's aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like "i think that i can see enough other ways to perform well on tasks that there's e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars".

this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there's more than a 2^-100 chance that there's some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).

my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny's would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was "there's a naive-but-relevant model that say's we're super-duper fucked; the details of it causes me to think that we're not in particulary good shape (though obviously not to that same level of credence)".

but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).

I don't think there's anything remotely resembling probabilistic reasoning going on here. I don't know what it is, but I do want to point at it and be like "that! that reasoning is totally broken!"

(yeah, my guess is that you're suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)

↑ comment by peterbarnett · 2023-12-20T08:23:48.883Z · LW(p) · GW(p)

(I obviously don’t speak for Ronny) I’d guess this is kinda the within-model uncertainty, he had a model of “alignment” that said you needed to specify all 10,000 bits of human values. And so the odds of doing this by default/at random was 2^-10000:1. But this doesn’t contain the uncertainty that this model is wrong, which would make the within-model uncertainty a rounding error.

According to this model there is effectively no chance of alignment by default, but this model could be wrong.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2023-12-20T09:17:51.768Z · LW(p) · GW(p)

If Ronnie had said "there is one half-baked heuristic that claims that the probability is 2^-10000" then I would be sympathetic. That seems very different to what he said, though. In some sense my objection is precisely to people giving half-baked heuristics and intuitions an amount of decision-making weight far disproportionate to their quality, by calling them "models" and claiming that the resulting credences should be taken seriously.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2023-12-20T12:26:38.275Z · LW(p) · GW(p)

I think that would be a more on-point objection to make on a single-author post, but this is a chat log between two people, optimized to communicate to each other, and as such generally comes with fewer caveats and taking a bunch of implicit things for granted (this makes it well-suited for some kind of communication, and not others). I like it in that it helps me get a much better sense of a bunch of underlying intuitions and hunches that are often hard to formalize and so rarely make it into posts, but I also think it is sometimes frustrating because it's not optimized to be responded to.

I would take bets that Ronny's position was always something closer to "I had this robust-seeming inside-view argument that claimed the probability was extremely low, though of course my outside view and different levels of uncertainty caused my betting odds to be quite different".

↑ comment by Ronny Fernandez (ronny-fernandez) · 2023-12-22T11:22:21.978Z · LW(p) · GW(p)

I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might've meant.

Replies from: ronny-fernandez

↑ comment by Ronny Fernandez (ronny-fernandez) · 2023-12-22T20:04:33.504Z · LW(p) · GW(p)

In retrospect I think the above was insufficiently cooperative. Sorry,

↑ comment by [deleted] · 2023-12-20T16:13:12.779Z · LW(p) · GW(p)

Replies from: lahwran, None

↑ comment by the gears to ascension (lahwran) · 2023-12-21T06:41:24.264Z · LW(p) · GW(p)

I don't see why it should rule out high probability of doom for some folks who present themselves as having good epistemics to actually be quite bad at picking up new models and stuck in an old, limiting paradigm, refusing to investigate new things properly because of believing themselves to already know. It certainly does weaken appeals to their authority, but the reasoning stands on its own, to the degree it's actually specified using valid and relevant formal claims.

↑ comment by [deleted] · 2023-12-20T23:48:06.338Z · LW(p) · GW(p)

↑ comment by Ronny Fernandez (ronny-fernandez) · 2023-12-22T11:22:07.324Z · LW(p) · GW(p)

comment by Ronny Fernandez (ronny-fernandez) · 2023-12-22T11:29:00.924Z · LW(p) · GW(p)

To be clear, I did not think we were discussing the AI optimist post. I don't think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.

comment by Garrett Baker (D0TheMath) · 2023-12-20T17:01:42.324Z · LW(p) · GW(p)

seems kinda hard to make something formal to me because the basic argument is, i think, "there's really a lot of ways for a model to do well in training", but i don't know how one is supposed to formalize that. i guess i'm curious where you think the force of formality comes in for the analogous argument when it comes to python programs

This may not be easily formalizable, but this does seem easily testable? Like, whats wrong with just training a bunch of different models, and seeing if they have similar generalization properties? If they're radically different, then there's many ways of doing well in training. If they're pretty similar, then there's very few ways of doing well in training.

comment by Steven Byrnes (steve2152) · 2023-12-20T19:09:46.317Z · LW(p) · GW(p)

I think evolution-of-humans is kinda like taking a model-based RL algorithm (for within-lifetime learning), and doing a massive outer-loop search over neural architectures, hyperparameters, and also reward functions. In principle (though IMO almost definitely not in practice [LW · GW]), humans could likewise do that kind of grand outer-loop search over RL algorithms, and get AGI that way. And if they did, I strongly expect that the resulting AGI would have a “curiosity” term in its reward function, as I think humans do. After all, a curiosity reward-function term is already sometimes used in today’s RL, e.g. the Montezuma’s Revenge literature, and it’s not terribly complicated, and it’s useful, and I think innate-curiosity-drive exists not only in humans but also in much much simpler animals. Maybe there’s more than one way to implement curiosity-drive in detail, but something in that category seems pretty essential for an RL algorithm to train successfully in a complex environment, and I don’t think I’m just over-indexing on what’s familiar.

Again, this is all pretty irrelevant on my models because I don’t expect that people will program AGI by doing a blind outer-loop search over RL reward functions. Rather, I expect that people will write down the RL reward function for AGI in the form of handwritten source code, and that they will put curiosity-drive into that reward function source code (as they already sometimes do), because they will find that it’s essential for capabilities.

Separately, insofar as curiosity-drive is essential for capabilities (as I believe), it doesn’t help alignment, but rather hurts alignment, because it’s bad if an AI wants to satisfy its own curiosity at the expense of things that humans directly value. Hopefully that’s obvious to everyone here, right? Parts of the discussion seemed to be portraying AIs-that-are-curious as a good thing rather than a bad thing, which was confusing to me. I assume I was just failing to follow all the unspoken context?

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2023-12-21T01:15:35.268Z · LW(p) · GW(p)

maintaining uncertainty about the true meaning of an objective is important, but there's a difference between curiosity about the true values one holds, intrinsic curiosity as a component of a value system, and instrumental curiosity as a consequence of an uncertain planning system. I'm surprised to see disagree from MiguelDev and Noosphere, could either of you expand on what you disagree with?

Replies from: whitehatStoic

↑ comment by MiguelDev (whitehatStoic) · 2023-12-21T01:58:36.221Z · LW(p) · GW(p)

@the gears to ascension [LW · GW] Hello! I just think curiosity is a low level attribute that allows a reaction and it maybe good or bad all things considered, with this regard curiosity (or studying curiosity) may help in alignment as well.

For example, an AI is in a situation that it needs to save someone from a burning house, it should be curious enough to consider all possible options available and eventually if it is aligned - it will choose the actions that will result to good outcomes (after also studying all the bad options). That is why I don't agree with the idea that it purely hurts alignment as described in the comment.

(I think Nate and Ronny shares important knowledge in this dialogue - low level forces (birthed by evolution) that I think is misunderstood by many.)

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-12-21T02:57:55.474Z · LW(p) · GW(p)

Your example is about capabilities (assuming the AI is trying to save me from the fire, will it succeed?) but I was talking about alignment (is the AI trying to save me from the fire in the first place?)

I don’t want the AI to say “On the one hand, I care about Steve’s welfare. On the other hand, man I’m just really curious how people behave when they’re on fire. Like, what do they say? What do they do? So, I feel torn—should I save Steve from the fire or not? Hmm…”

(I agree that, if an AI is aligned, and if it is trying to save me from a burning house, then I would rather that the AI be more capable rather than less capable—i.e., I want the AI to come up with and execute very very good plans.)

Even if an AI decides human flourishing is briefly interesting, after a while it will already know lots of things about human flourishing and want to learn something else instead. Scientists have occasionally made colonies of extremely happy well-adjusted rats to see what would happen. But then they learned what happened, and switched back to things like testing how long rats would struggle against their inevitable deaths if you left them to drown in locked containers.

As for capabilities, I think curiosity drive is probably essential during early RL training. Once the AI is sufficiently intelligent (including in metacognitive / self-reflective ways), it’s plausible that we could turn curiosity drive off without harming capabilities. After all, it’s possible for an AI to “consider all possible options” not because it’s curious, but rather because it wants me to not die in the fire, and it’s smart enough to know that “considering all possible options” is a very effective means-to-an-end for preventing me from dying in the fire.

Humans can do that too. We don’t only seek information because we’re curious; we can also do it as a means to an end. For example, sometimes I have really wanted to do something, and so then I read an mind-numbingly-boring book that I expect might help me do that thing. Curiosity is not driving me to read the book; on the contrary, curiosity is pushing me away from the book with all its might, because anything else on earth would be more inherently interesting than this boring book. But I read the book anyway, because I really want to do the thing, and I know that reading the book will help. I think an AI which is maximally beneficial to humans would have a similar kind of motivation. Yes it would often brainstorm, and ponder, and explore, and seek information, etc., but it would do all those things not because they are inherently rewarding, but rather because it knows that doing those things is probably useful for what it really wants at the end of the day, which is to benefit humans.

Replies from: whitehatStoic

↑ comment by MiguelDev (whitehatStoic) · 2023-12-21T03:30:19.395Z · LW(p) · GW(p)

Once the AI is sufficiently intelligent (including in metacognitive / self-reflective ways), it’s plausible that we could turn curiosity drive off without harming capabilities. After all, it’s possible for an AI to “consider all possible options” not because it’s curious, but rather because it wants me to not die in the fire, and it’s smart enough to know that “considering all possible options” is a very effective means-to-an-end for preventing me from dying in the fire.

Interesting view but I have to point out that situations change and there will be many tiny details that will become like a back and forth discussion inside the AI's network as it performs its tasks and turning off curiosity will most likely end up in the worst outcomes as it my not be able to update its decisions (eg. oops didn't saw there was a fire hose available or oops I didn't felt the heat of the floor earlier).

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-12-21T13:24:28.735Z · LW(p) · GW(p)

Person A: If you’re going to program a chess-playing agent, it needs a direct innate intrinsic drive to not lose its queen.
Person B: Nah, if losing one’s queen is generally bad, it can learn that fact from experience, or from thinking through the likely consequences in any particular case.
Person A: No, that’s not good enough. Protecting the queen is really important. Maybe the AI will learn from experience to not lose its queen in some situations, but situations change and then it will not be motivated to protect its queen sufficiently.

Obviously, Person B is correct here, because AlphaZero-chess works well.

To my ears, your claim (that an AI without intrinsic drive to satisfy curiosity cannot learn to update its decisions) is analogous to Person A’s claim (that an AI without intrinsic drive to protect its queen cannot learn to do so).

In other words, if it’s obvious to you that the AI is insufficiently updating its decisions, it would be obvious to the AI as well (once the AI is sufficiently smart and self-aware). And then the AI can correct for that.

Replies from: whitehatStoic

↑ comment by MiguelDev (whitehatStoic) · 2023-12-22T02:48:20.465Z · LW(p) · GW(p)

Thanks for explaining your views and this had helped me deconfuse myself, when I was replying and thinking: I am now drawing lines wherein curiosity and self-awareness overlaps also making me feel the expansive nature of studying the theoretical alignment, it's very dense and it's so easy to drown in information - this discussion made me feel a whack of a baseball bat then survived to write this comment. Moreover, how to get to Person B still requires knowledge of curiosity and its mechanisms, I still err on the side of finding out how it works^[1] or gets imbued to intelligent systems (us and AI) - for me this is very relevant to alignment work.

^{^}
I'm speculating a simplified evolutionary cognitive chain in humans: curiosity + survival instincts (including hunger) → intelligence → self-awareness → rationality.

comment by Signer · 2023-12-20T08:39:48.329Z · LW(p) · GW(p)

you can argue all you want that any flying device will have to flap its wings, and that won’t constrain airplane designs

You can argue all you want that any thinking device will have to reflect on its thoughts, and that won’t constrain mind designs.

the prior is still really wide, so wide that a counting argument still more-or-less works

And it also works for arguing that GPT3 won't happen - there are more hacks that give you low loss than there useful to humans hacks that give you low loss.

so does your whole sense of difference go out the window if we do something autogpt-ish?

I think it should be analyzed separately, but intuitively if your gpt never thinks of killing humans, it should be less likely that the plans with these thoughts would result in killing humans.

comment by Thane Ruthenis · 2023-12-20T12:09:01.174Z · LW(p) · GW(p)

at this juncture i interpret the shard theory folk as arguing something like "well the shards that humans build their values up around are very proximal to minds

In the spirit of pointing out subtle things that seem wrong: My understanding of the ST position is that shards are values. There's no "building values around" shards; the idea is that shards are what implements values and values are implemented as shards.

At least, I'm pretty sure that's what the position was a ~year ago, and I've seen no indications the ST folk moved from that view.

most humans (with fully-functioning brains) have in some sense absorbed sufficiently similar values and reflective machinery that they converge to roughly the same place

The way I would put it is "it's plausible that there is an utility function such that the world-state maximizing it is ranked as very high by the standards of most humans' preferences, and we could get that utility function by agglomerating and abstracting over individual humans' values".

Like, if Person A loves seafood and hates pizza, and Person B loves pizza and hates seafood, then no, agglomerating these individual people's preferences into Utility Function A and Utility Function B won't result in the same utility function (and more so for more important political/philosophical stuff). But if we abstract up from there, we get "people like to eat tasty-according-to-them food", and then a world in which both A and B are allowed to do that would rank high by the preferences of both of them.

Similarly, it seems plausible that somewhere up there at the highest abstraction levels [LW(p) · GW(p)], most humans' preferences (stripped of individual nuance on their way up) converge towards the same "maximize eudaimonia" utility function, whose satisfaction would make ~all of us happy. (And since it's highly abstract, its maximal state would be defined over an enormous equivalence class of world-states. So it won't be a universe frozen in a single moment of time, or tiled with people with specific preferences, or anything like that.)

comment by Seth Herd · 2023-12-21T02:13:35.852Z · LW(p) · GW(p)

I was excited to read this, because Nate is a clear writer and a clear thinker, who has a high p(doom) for reasons I don't entirely understand. This did pay off for me in a brief statement that clarified some of his reasons I hadn't understood:

Nate said

this is part of what i mean by "i don't think alignment is all that hard"

my high expectation of doom comes from a sense that there's lots of hurdles and that humanity will flub at least one (and probably lots)

I find this disturbingly compelling. I hadn't known Nate thought alignment might be fairly easy. Given that, his pessimism is more relevant to me, since I'm pretty sure alignment is do-able even in the near future.

I'm afraid I found the rest of this convoluted and to make little progress on a contentful discussion.

Let me try to summarize the post in case it's helpful. None of these are direct quotes

Nate: I think alignment by default is highly unlikely Ronny: I think alignment by default is highly unlikely (this somehow took most of the conversation) Ronny: But we won't do alignment by default. We'll do it with RL. Sometimes, when I talk to Quintin, I think we might get working alignment by doing RL and pointing the system at lots of stuff we want it to do. It might reproduce human values accurately enough to do that. Nate: There are a lot of ways to get anything done. So telling it what you want it to do is probably not going to make it generalize well or actually value the things you value. Ronny: I agree, but I don't have a strong argument for it. ...

So in sum I didn't see any strong argument for it beyond the "lots of ways to get things done, so a value match is unlikely".

Like Rob and Nate, my intuition says that's unlikely to work.

The number of ways to get things done is substantially constrained if the system is somehow trained to use human concepts and thinking patterns. So maybe that's the source of optimism for Quintin and the Shard Theorists? Training on language does seem to substantially constrain a model to use human-like concepts.

I think the bulk of the disagreement is deeper and vaguer. One point of vague disagreement seems to be something like: Theory suggests that alignment is hard. Empirical data (mostly from LLMs) suggests that it's easy to make AI do what you want. Which do you believe?

Fortunately, I don't think RL alignment is our only or best option, so I'm not hugely invested in the disagreement as it stands, because both perspectives are primarily thinking about RL alignment. I think We have promising alignment plans with low taxes [AF · GW]

I think they're promising because they're completely different than RL approaches. More on that in an upcoming post.

comment by Noosphere89 (sharmake-farah) · 2023-12-21T04:26:19.656Z · LW(p) · GW(p)

Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning

Contents

The basic (counting) argument

Evolution / Reflection Process is Path Dependent

Summary and discussion of training an agent in a simulation

Alignment problem probably fixable, but likely won't be fixed

Discussing whether this argument about training can be formalized

30 comments