Posts

When is a mind me? 2024-04-17T05:56:38.482Z
AI Views Snapshots 2023-12-13T00:45:50.016Z
An artificially structured argument for expecting AGI ruin 2023-05-07T21:52:54.421Z
AGI ruin mostly rests on strong claims about alignment and deployment, not about society 2023-04-24T13:06:02.255Z
The basic reasons I expect AGI ruin 2023-04-18T03:37:01.496Z
Four mindset disagreements behind existential risk disagreements in ML 2023-04-11T04:53:48.427Z
Yudkowsky on AGI risk on the Bankless podcast 2023-03-13T00:42:22.694Z
Elements of Rationalist Discourse 2023-02-12T07:58:42.479Z
Thoughts on AGI organizations and capabilities work 2022-12-07T19:46:04.004Z
A challenge for AGI organizations, and a challenge for readers 2022-12-01T23:11:44.279Z
A common failure for foxes 2022-10-14T22:50:59.614Z
ITT-passing and civility are good; "charity" is bad; steelmanning is niche 2022-07-05T00:15:36.308Z
The inordinately slow spread of good AGI conversations in ML 2022-06-21T16:09:57.859Z
On saving one's world 2022-05-17T19:53:58.192Z
Late 2021 MIRI Conversations: AMA / Discussion 2022-02-28T20:03:05.318Z
Animal welfare EA and personal dietary options 2022-01-05T18:53:02.157Z
Some abstract, non-technical reasons to be non-maximally-pessimistic about AI alignment 2021-12-12T02:08:08.798Z
Conversation on technology forecasting and gradualism 2021-12-09T21:23:21.187Z
Leaving Orbit 2021-12-06T21:48:41.371Z
Discussion with Eliezer Yudkowsky on AGI interventions 2021-11-11T03:01:11.208Z
Excerpts from Veyne's "Did the Greeks Believe in Their Myths?" 2021-11-08T20:23:25.271Z
Transcript for Geoff Anders and Anna Salamon's Oct. 23 conversation 2021-11-08T02:19:04.189Z
2020 PhilPapers Survey Results 2021-11-02T05:00:13.859Z
Nate Soares on the Ultimate Newcomb's Problem 2021-10-31T19:42:01.353Z
Quick general thoughts on suffering and consciousness 2021-10-30T18:05:59.612Z
MIRI/OP exchange about decision theory 2021-08-25T22:44:10.389Z
COVID/Delta advice I'm currently giving to friends 2021-08-24T03:46:09.053Z
Outline of Galef's "Scout Mindset" 2021-08-10T00:16:59.050Z
Garrabrant and Shah on human modeling in AGI 2021-08-04T04:35:11.225Z
Finite Factored Sets: LW transcript with running commentary 2021-06-27T16:02:06.063Z
"Existential risk from AI" survey results 2021-06-01T20:02:05.688Z
Predict responses to the "existential risk from AI" survey 2021-05-28T01:32:18.059Z
Sabien on "work-life" balance 2021-05-20T18:33:37.981Z
MIRI location optimization (and related topics) discussion 2021-05-08T23:12:02.476Z
Scott Alexander 2021 predictions: calibration and updating exercise 2021-04-29T19:15:39.463Z
Thiel on secrets and indefiniteness 2021-04-20T21:59:35.792Z
2012 Robin Hanson comment on “Intelligence Explosion: Evidence and Import” 2021-04-02T16:26:51.725Z
Logan Strohl on exercise norms 2021-03-30T04:28:22.331Z
Julia Galef and Matt Yglesias on bioethics and "ethics expertise" 2021-03-30T03:06:07.323Z
Thirty-three randomly selected bioethics papers 2021-03-22T21:38:08.281Z
Politics is way too meta 2021-03-17T07:04:42.187Z
Deflationism isn't the solution to philosophy's woes 2021-03-10T00:20:07.357Z
What I'd change about different philosophy fields 2021-03-08T18:25:30.165Z
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" 2021-03-05T23:43:54.186Z
Utilitarian doppelgangers vs. making everything smell like bananas 2021-02-20T23:57:34.724Z
MIRI: 2020 Updates and Strategy 2020-12-23T21:27:39.206Z
Cartesian Frames Definitions 2020-11-08T12:44:34.509Z
"Cartesian Frames" Talk #2 this Sunday at 2pm (PT) 2020-10-28T13:59:20.991Z
Updates and additions to "Embedded Agency" 2020-08-29T04:22:25.556Z
DontDoxScottAlexander.com - A Petition 2020-06-25T05:44:50.050Z

Comments

Comment by Rob Bensinger (RobbBB) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-19T00:00:37.122Z · LW · GW

Sounds like a lot of political alliances! (And "these two political actors are aligned" is arguably an even weaker condition than "these two political actors are allies".)

At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.

Comment by Rob Bensinger (RobbBB) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-18T22:53:26.766Z · LW · GW

It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions".

Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.

(And a lot of terms, like "value loading" and "value learning", that are pointing at the research project of getting good intentions into the AI.)

To my ear, "aligned person" sounds less like "this person wishes the best for me", and more like "this person will behave in the right ways".

If I hear that Russia and China are "aligned", I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.

And if we step back from the human realm, an engineered system can be "aligned" in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.

Cf. the history of the term "AI alignment". From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term "AI alignment" was that we wanted to switch away from "Friendly AI" to a term that sounded more neutral. "Friendly AI research" had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing "Friendliness" made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.

But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted "alignment" to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.

I've basically given up on trying to achieve uniformity on what "AI alignment" is; the best we can do, I think, is clarify whether we're talking about "intent alignment" vs. "outcome alignment" when the distinction matters.

But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn't have a word for this idea I think it would be very important to invent one. 

IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers' focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this "keep our eye on the ball" skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to "sell" AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).

And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that "alignment" is about AI systems' goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T21:51:23.962Z · LW · GW

"Should" in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?

In the OP: "Should" in order to have more accurate beliefs/expectations. E.g., I should anticipate (with high probability) that the Sun will rise tomorrow in my part of the world, rather than it remaining night.

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T21:46:58.831Z · LW · GW

Why would the laws of physics conspire to vindicate a random human intuition that arose for unrelated reasons?

We do agree that the intuition arose for unrelated reasons, right? There's nothing in our evolutionary history, and no empirical observation, that causally connects the mechanism you're positing and the widespread human hunch "you can't copy me".

If the intuition is right, we agree that it's only right by coincidence. So why are we desperately searching for ways to try to make the intuition right?

It also doesn't force us to believe that a bunch of water pipes or gears functioning as a classical computer can ever have our own first person experience.

Why is this an advantage of a theory? Are you under the misapprehension that "hypothesis H allows humans to hold on to assumption A" is a Bayesian update in favor of H even when we already know that humans had no reason to believe A? This is another case where your theory seems to require that we only be coincidentally correct about A ("sufficiently complex arrangements of water pipes can't ever be conscious"), if we're correct about A at all.

One way to rescue this argument is by adding in an anthropic claim, like: "If water pipes could be conscious, then nearly all conscious minds would be instantiated in random dust clouds and the like, not in biological brains. So given that we're not Boltzmann brains briefly coalescing from space dust, we should update that giant clouds of space dust can't be conscious."

But is this argument actually correct? There's an awful lot of complex machinery in a human brain. (And the same anthropic argument seems to suggest that some of the human-specific machinery is essential, else we'd expect to be some far-more-numerous observer, like an insect.) Is it actually that common for a random brew of space dust to coalesce into exactly the right shape, even briefly?

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T16:10:59.017Z · LW · GW

Yeah, at some point we'll need a proper theory of consciousness regardless, since many humans will want to radically self-improve and it's important to know which cognitive enhancements preserve consciousness.

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T16:05:11.119Z · LW · GW

You can easily clear this confusion if you rephrase it as "You should anticipate having any of these experiences". Then it's immediately clear that we are talking about two separate screens.

This introduces some other ambiguities. E.g., "you should anticipate having any of these experiences" may make it sound like you have a choice as to which experience to rationally expect.

And it's also clear that our curriocity isn't actually satisfied. That the question "which one of these two will actually be the case" is still very much on the table.

... And the answer is "both of these will actually be the case (but not in a split-screen sort of way)".

Your rephrase hasn't shown that there was a question left unanswered in the original post; it's just shown that there isn't a super short way to crisply express what happens in English, you do actually have to add the clarification.

Still as soon as we got Rob-y and Rob-z they are not "metaphysically the same person". When Rob-y says "I" he is reffering to Rob-y, not Rob-z and vice versa. More specifically Rob-y is refering to some causal curve through time ans Rob-z is refering to another causal curve through time. These two curves are the same to some point, but then they are not. 

Yep, I think this is a perfectly fine way to think about the thing.

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T02:42:54.501Z · LW · GW

My first issue with your post is that this initial ontological assumption is neither mentioned explicitly nor motivated. Nothing in your post can be used as proof of this initial assumption.

There are always going to be many different ways someone could object to a view. If you were a Christian, you'd perhaps be objecting that the existence of incorporeal God-given Souls is the real crux of the matter, and if I were intellectually honest I'd be devoting the first half of the post to arguing against the Christian Soul.

Rather than trying to anticipate these objections, I'd rather just hear them stated out loud by their proponents and then hash them out in the comments. This also makes the post less boring for the sorts of people who are most likely to be on LW: physicalists and their ilk.

Now, what would be the experience of getting copied, seen from a first-person, "internal", perspective? I am pretty sure it would be something like: you walk into the room, you sit there, you  hear say the scanner working for some time, it stops, you walk out. From my agnostic perspective, if I were the one to be scanned it seems like nothing special would have happened to me in this procedure. I didnt feel anything weird, I didnt feel my "consciousness split into two" or something.

Why do you assume that you wouldn't experience the copy's version of events?

The un-copied version of you experiences walking into the room, sitting there, hearing the scanner working, and hearing it stop; then that version of you experiences walking out. It seems like nothing special happened in this procedure; this version of you doesn't feel anything weird, and doesn't feel like their "consciousness split into two" or anything.

The copied version of you experiences walking into the room, sitting here, hearing the scanner working, and then an instantaneous experience of (let's say) feeling like you've been teleported into another room -- you're now inside the simulation. Assuming the simulation feels like a normal room, it could well seem like nothing special happened in this procedure -- it may feel like blinking and seeing the room suddenly change during the blink, while you yourself remain unchanged. This version of you doesn't necessarily feel anything weird either, and they don't feel like their "consciousness split into two" or anything.

It's a bit weird that there are two futures, here, but only one past -- that the first part of the story is the same for both versions of you. But so it goes; that just comes with the territory of copying people.

If you disagree with anything I've said above, what do you disagree with? And, again, what do you mean by saying you're "pretty sure" that you would experience the future of the non-copied version?

Namely, if I consider this procedure as an empirical experiment, from my first person perspective I dont get any new / unexpected observation compared to say just sitting in an ordinary room. Even if I were to go and find my copy, my experience would again be like meeting a different person which just happens to look like me and which claims to have similar memories  up to the point when I entered the copying room. There would be no way to verify or to view things from their first person perspective.

Sure. But is any of this Bayesian evidence against the view I've outlined above? What would it feel like, if the copy were another version of yourself? Would you expect that you could telepathically communicate with your copy and see things from both perspectives at once, if your copies were equally "you"? If so, why?

On the contrary, I would be wary to, say, kill myself or to be destroyed after the copying procedure, since no change will have occured to my first person perspective, and it would thus seem less likely that my "experience" would somehow survive because of my copy.

Shall we make a million copies and then take a vote? :)

I agree that "I made a non-destructive software copy of myself and then experienced the future of my physical self rather than the future of my digital copy" is nonzero Bayesian evidence that physical brains have a Cartesian Soul that is responsible for the brain's phenomenal consciousness; the Cartesian Soul hypothesis does predict that data. But the prior probability of Cartesian Souls is low enough that I don't think it should matter.

You need some prior reason to believe in this Soul in the first place; the same as if you flipped a coin, it came up heads, and you said "aha, this is perfectly predicted by the existence of an invisible leprechaun who wanted that coin to come up heads!". Losing a coinflip isn't a surprising enough outcome to overcome the prior against invisible leprechauns.

and it would also force me to accept that even a copy where the "circuit" is made of water pipes and pumps, or gears and levers also have an actual, first person experience as "me", as long as the appropriate computations are being carried out.  

Why wouldn't it? What do you have against water pipes?

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T02:04:44.332Z · LW · GW

Wouldn't it follow that in the same way you anticipate the future experiences of the brain that you "find yourself in" (i.e. the person reading this) you should anticipate all experiences, i.e. that all brain states occur with the same kind of me-ness/vivid immediacy?

What's the empirical or physical content of this belief?

I worry that this may be another case of the Cartesian Ghost rearing its ugly head. We notice that there's no physical thingie that makes the Ghost more connected to one experience or the other; so rather than exorcising the Ghost entirely, we imagine that the Ghost is connected to every experience simultaneously.

But in fact there is no Ghost. There's just a bunch of experience-moments implemented in brain-moments.

Some of those brain-moments resemble other brain-moments, either by coincidence or because of some (direct or indirect) causal link between the brain-moments. When we talk about Brain-1 "anticipating" or "becoming" a future brain-state Brain-2, we normally mean things like:

  • There's a lawful physical connection between Brain-1 and Brain-2, such that the choices and experiences of Brain-1 influence the state of Brain-2 in a bunch of specific ways.
  • Brain-2 retains ~all of the memories, personality traits, goals, etc. of Brain-1.
  • If Brain-2 is a direct successor to Brain-1, then typically Brain-2 can remember a bunch of things about the experience Brain-1 was undergoing.

These are all fuzzy, high-level properties, which admit of edge cases. But I'm not seeing what's gained by therefore concluding "I should anticipate every experience, even ones that have no causal connection to mine and no shared memories and no shared personality traits". Tables are a fuzzy and high-level concept, but that doesn't mean that every object in existence is a table. It doesn't even mean that every object is slightly table-ish. A photon isn't "slightly table-ish", it's just plain not a table.

Which just means, all brain states exist in the same vivid, for-me way, since there is nothing further to distinguish between them that makes them this vivid, i.e. they all exist HERE-NOW.

But they don't have the anticipation-related properties I listed above; so what hypotheses are we distinguishing by updating from "these experiences aren't mine" to "these experiences are mine"?

Maybe the update that's happening is something like: "Previously it felt to me like other people's experiences weren't fully real. I was unduly selfish and self-centered, because my experiences seemed to me like they were the center of the universe; I abstractly and theoretically knew that other people have their own point of view, but that fact didn't really hit home for me. Then something happened, and I had a sudden realization that no, it's all real."

If so, then that seems totally fine to me. But I worry that the view in question might instead be something tacitly Cartesian, insofar as it's trying to say "all experiences are for me" -- something that doesn't make a lot of sense to say if there are two brain states on opposite sides of the universe with nothing in common and nothing connecting them, but that does make sense if there's a Ghost the experiences are all "for".

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T01:07:27.876Z · LW · GW

As a test, I asked a non-philosopher friend of mine what their view is. Here's a transcript of our short conversation: https://docs.google.com/document/d/1s1HOhrWrcYQ5S187vmpfzZcBfolYFIbeTYgqeebNIA0/edit 

I was a bit annoyingly repetitive with trying to confirm and re-confirm what their view is, but I think it's clear from the exchange that my interpretation is correct at least for this person.

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-18T00:44:26.907Z · LW · GW

Is there even anybody claiming there is an experiential difference?

Yep! Ask someone with this view whether the current stream of consciousness continues from their pre-uploaded self to their post-uploaded self, like it continues when they pass through a doorway. The typical claim is some version of "this stream of consciousness will end, what comes next is only oblivion", not "oh sure, the stream of consciousness is going to continue in the same way it always does, but I prefer not to use the English word 'me' to refer to the later parts of that stream of consciousness".

This is why the disagreement here has policy implications: people with different views of personal identity have different beliefs about the desirability of mind uploading. They aren't just disagreeing about how to use words, and if they were, you'd be forced into the equally "uncharitable" perspective that someone here is very confused about how relevant word choice is to the desirability of uploading.

The alternative to this is that there is a disagreement about the appropriate semantic interpretation/analysis of the question. E.g. about what we mean when we say "I will (not) experience such and such". That seems more charitable than hypothesizing beliefs in "ghosts" or "magic".

I didn't say that the relevant people endorse a belief in ghosts or magic. (Some may do so, but many explicitly don't!)

It's a bit darkly funny that you've reached for a clearly false and super-uncharitable interpretation of what I said, in the same sentence you're chastising me for being uncharitable! But also, "charity" is a bad approach to trying to understand other people, and bad epistemology can get in the way of a lot of stuff.

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-17T20:28:54.036Z · LW · GW

The problem was that you first seemed to belittle questions about word meanings ("self") as being "just" about "definitions" that are "purely verbal".

I did no such thing!

Luckily now you concede that the question about the meaning of "I" isn't just about (arbitrary) "definitions"

Read the blog post at the top of this page! It's my attempt to answer the question of when a mind is "me", and you'll notice it's not talking about definitions.

But we already know all the empirical facts: Someone goes into the teleporter, a bit later someone comes out at the other end and sees something. So the issue can only be about the semantic interpretation of that question, about what we mean with expressions like "I will see x".

Nope!

There are two perspectives here:

  1. "I don't want to upload myself, because I wouldn't get to experience that uploads' experiences. When I die, this stream of consciousness will end, rather than continuing in another body. Physically dying and then being being copied elsewhere is not phenomenologically indistinguishable from stepping through a doorway."
  2. "I do want to upload myself, because I would get to experience that uploads' experiences. Physically dying and then being copied myself is phenomenologically indistinguishable from stepping through a doorway."

The disagreement between these two perspectives isn't about word definitions at all; a fear that "when my body dies, there will be nothing but oblivion" is a very real fear about anticipated experiences (and anticipated absences of experience), not a verbal quibble about how we ought to define a specific word.

But it's also a bit confusing to call the disagreement between these two perspectives "empirical", because "empirical" here is conflating "third-person empirical" with "first-person empirical".

The disagreement here is about whether a stream of consciousness can "continue" across temporal and spatial gaps, in the same way that it continues when there are no obvious gaps. It's about whether there's a subjective, experiential difference between stepping through a doorway and using a teleporter.

The thing I'm arguing in the OP is that there can't be an experiential difference here, because there's no physical difference that could be underlying the supposed experiential difference. So the disagreement about the first-person facts, I claim, stems from a cognitive error, which I characterize as "making predictions as though you believed yourself to be a Cartesian Ghost (even if you don't on-reflection endorse the claim that Cartesian Ghosts exist)". This is, again, a very different error from "defining a word in a nonstandard way".

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-17T20:09:21.050Z · LW · GW

You're also free to define "I" however you want in your values.

Sort of!

  • It's true that no law of nature will stop you from using "I" in a nonstandard way; your head will not explode if you redefine "table" to mean "penguin".
  • And it's true that there are possible minds in abstract mindspace that have all sorts of values, including strict preferences about whether they want their brain to be made of silicon vs. carbon.
  • But it's not true that humans alive today have full and complete control over their own preferences.
  • And it's not true that humans can never be mistaken in their beliefs about their own preferences.

In the case of teleportation, I think teleportation-phobic people are mostly making an implicit error of the form "mistakenly modeling situations as though you are a Cartesian Ghost who is observing experiences from outside the universe", not making a mistake about what their preferences are per se. (Though once you realize that you're not a Cartesian Ghost, that will have some implications for what experiences you expect to see next in some cases, and implications for what physical world-states you prefer relative to other world-states.)

Comment by Rob Bensinger (RobbBB) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-17T18:54:41.932Z · LW · GW

FWIW, I typically use "alignment research" to mean "AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI" (with an emphasis on "safely"). So I'd include things like Chris Olah's interpretability research, even if the proximate impact of this is just "we understand what's going on better, so we may be more able to predict and finely control future systems" and the proximate impact is not "the AI is now less inclined to kill you".

Some examples: I wouldn't necessarily think of "figure out how we want to airgap the AI" as "alignment research", since it's less about designing the AI, shaping its mind, predicting and controlling it, etc., and more about designing the environment around the AI.

But I would think of things like "figure out how to make this AI system too socially-dumb to come up with ideas like 'maybe I should deceive my operators', while keeping it superhumanly smart at nanotech research" as central examples of "alignment research", even though it's about controlling capabilities ('make the AI dumb in this particular way') rather than about instilling a particular goal into the AI.

And I'd also think of "we know this AI is trying to kill us; let's figure out how to constrain its capabilities so that it keeps wanting that, but is too dumb to find a way to succeed in killing us, thereby forcing it to work with us rather than against us in order to achieve more of what it wants" as a pretty central example of alignment research, albeit not the sort of alignment research I feel optimistic about. The way I think about the field, you don't have to specifically attack the "is it trying to kill you?" part of the system in order to be doing alignment research; there are other paths, and alignment researchers should consider all of them and focus on results rather than marrying a specific methodology.

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-17T16:48:43.345Z · LW · GW

But that isn't an experience. It's two experiences. You will not have an experience of having two experiences. Two experiences will experience having been one person.

Sure; from my perspective, you're saying the same thing as me.

Are you going to care about 1000 different copies equally?

How am I supposed to choose between them?

Comment by Rob Bensinger (RobbBB) on When is a mind me? · 2024-04-17T16:44:41.588Z · LW · GW

Why? If "I" is arbitrary definition, then “When I step through this doorway, will I have another experience?" depends on this arbitrary definition and so is also arbitrary.

Which things count as "I" isn't an arbitrary definition; it's just a fuzzy natural-language concept.

(I guess you can call that "arbitrary" if you want, but then all the other words in the sentence, like "doorway" and "step", are also "arbitrary".)

Analogy: When you're writing in your personal diary, you're free to define "table" however you want. But in ordinary English-language discourse, if you call all penguins "tables" you'll just be wrong. And this fact isn't changed at all by the fact that "table" lacks a perfectly formal physics-level definition.

The same holds for "Will Rob Bensinger's next experience be of sitting in his bedroom writing a LessWrong comment, or will it be of him grabbing some tomatoes in a supermarket in Beijing?"

Terms like 'Rob Bensinger' and 'I' aren't perfectly physically crisp — there may be cases where the answer is "ehh, maybe?" rather than a clear yes or no. And if we live in a Big Universe and we allow that there can be many Beijings out there in space, then we'll have to give a more nuanced quantitative answer, like "a lot more of Rob's immediate futures are in his bedroom than in Beijing".

But if we restrict our attention to this Beijing, then all that complexity goes away and we can pretty much rule out that anyone in Beijing will happen to momentarily exhibit exactly the right brain state to look like "Rob Bensinger plus one time step".

The nuances and wrinkles don't bleed over and make it a totally meaningless or arbitrary question; and indeed, if I thought I were likely to spontaneously teleport to Beijing in the next minute, I'd rightly be making very different life-choices! "Will I experience myself spontaneously teleporting to Beijing in the next second?" is a substantive (and easy) question, not a deep philosophical riddle.

So you always anticipate all possible experiences, because of multiverse? 

Not all possible experiences; just all experiences of brains that have the same kinds of structural similarities to your current brain as, e.g., "me after I step through a doorway" has to "me before I stepped through the doorway".

Comment by Rob Bensinger (RobbBB) on "AI Alignment" is a Dangerously Overloaded Term · 2024-03-19T20:32:24.795Z · LW · GW

The problem is another way to phrase this is a superintelligent weapon system - "ending a risk period" by "reliably, and efficiently doing a small number of specific concrete tasks" means using physical force to impose your will on others.

The pivotal acts I usually think about actually don't route through physically messing with anyone else. I'm usually thinking about using aligned AGI to bootstrap to fast human whole-brain emulation, then using the ems to bootstrap to fully aligned CEV AI.

If someone pushes a "destroy the world" button then the ems or CEV AI would need to stop the world from being destroyed, but that won't necessarily happen if the developers have enough of a lead, if they get the job done quickly enough, and if CEV AI is able to persuade the world to step back from the precipice voluntarily (using superhumanly good persuasion that isn't mind-control-y, deceptive, or otherwise consent-violating). It's a big ask, but not as big as CEV itself, I expect.

From my current perspective this is all somewhat of a moot point, however, because I don't think alignment is tractable enough that humanity should be trying to use aligned AI to prevent human extinction. I think we should instead hit the brakes on AI and shift efforts toward human enhancement, until some future generation is in a better position to handle the alignment problem.

If and only if that fails it may be appropriate to consider less consensual options.

It's not clear to me that we disagree in any action-relevant way, since I also don't think AI-enabled pivotal acts are the best path forward anymore. I think the path forward is via international agreements banning dangerous tech, and technical research to improve humanity's ability to wield such tech someday.

That said, it's not clear to me how your "if that fails, then try X instead" works in practice. How do you know when it's failed? Isn't it likely to be too late by the time we're sure that we've failed on that front? Indeed, it's plausibly already too late for humanity to seriously pivot to 'aligned AGI'. If I thought humanity's last best scrap of hope for survival lay in an AI-empowered pivotal act, I'd certainly want more details on when it's OK to start trying to figure out have humanity not die via this last desperate path.

Comment by Rob Bensinger (RobbBB) on MIRI 2024 Mission and Strategy Update · 2024-01-06T01:39:29.697Z · LW · GW

To pick out a couple of specific examples from your list, Wei Dai:

14. Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can't be "balanced out" later) prior to reaching moral/philosophical maturity

This is a serious long-term concern if we don't kill ourselves first, but it's not something I see as a premise for "the priority is for governments around the world to form an international agreement to halt AI progress". If AI were easy to use for concrete tasks like "build nanotechnology" but hard to use for things like CEV, I'd instead see the priority as "use AI to prevent anyone else from destroying the world with AI", and I wouldn't want to trade off probability of that plan working in exchange for (e.g.) more probability of the US and the EU agreeing in advance to centralize and monitor large computing clusters.

After someone has done a pivotal act like that, you might then want to move more slowly insofar as you're worried about subtle moral errors creeping in to precursors to CEV.

30. AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else)

I currently assign very low probability to humans being able to control the first ASI systems, and redirecting governments' attention away from "rogue AI" and toward "rogue humans using AI" seems very risky to me, insofar as it causes governments to misunderstand the situation, and to specifically misunderstand it in a way that encourages racing.

If you think rogue actors can use ASI to achieve their ends, then you should probably also think that you could use ASI to achieve your own ends; misuse risk tends to go hand-in-hand with "we're the Good Guys, let's try to outrace the Bad Guys so AI ends up in the right hands". This could maybe be justified if it were true, but when it's not even true it strikes me as an especially bad argument to make.

Comment by Rob Bensinger (RobbBB) on MIRI 2024 Mission and Strategy Update · 2024-01-06T01:05:17.532Z · LW · GW

Yep, before I saw orthonormal's response I had a draft-reply written that says almost literally the same thing:

we just call 'em like we see 'em

[...]

insofar as we make bad predictions, we should get penalized for it. and insofar as we think alignment difficulty is the crux for 'why we need to shut it all down', we'd rather directly argue against illusory alignment progress (and directly acknowledge real major alignment progress as a real reason to be less confident of shutdown as a strategy) rather than redirect to something less cruxy

I'll also add: Nate (unlike Eliezer, AFAIK?) hasn't flatly said 'alignment is extremely difficult'. Quoting from Nate's "sharp left turn" post:

Many people wrongly believe that I'm pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That's flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.

I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it's somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it's somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it's all that qualitatively different than the sorts of summits humanity has surmounted before.

It's made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope.

What undermines my hope is that nobody seems to be working on the hard bits, and I don't currently expect most people to become convinced that they need to solve those hard bits until it's too late.

So it may be that Nate's models would be less surprised by alignment breakthroughs than Eliezer's models. And some other MIRI folks are much more optimistic than Nate, FWIW.

My own view is that I don't feel nervous leaning on "we won't crack open alignment in time" as a premise, and absent that premise I'd indeed be much less gung-ho about government intervention.

why put all your argumentative eggs in the "alignment is hard" basket? (If you're right, then policymakers can't tell that you're right.)

The short answer is "we don't put all our eggs in the basket" (e.g., Eliezer's TED talk and TIME article emphasize that alignment is an open problem, but they emphasize other things too, and they don't go into detail on exactly how hard Eliezer thinks the problem is), plus "we very much want at least some eggs in that basket because it's true, it's honest, it's cruxy for us, etc." And it's easier for policymakers to acquire strong Bayesian evidence for "the problem is currently unsolved" and "there's no consensus about how to solve it" and "most leaders in the field seem to think there's a serious chance we won't solve it in time" than to acquire strong Bayesian evidence for "we're very likely generations away from solving alignment", so the difficulty of communicating the latter isn't a strong reason to de-emphasize all the former points.

The longer answer is a lot more complicated. We're still figuring out how best to communicate our views to different audiences, and "it's hard for policymakers to evaluate all the local arguments or know whether Yann LeCun is making more sense than Yoshua Bengio" is a serious constraint. If there's a specific argument (or e.g. a specific three arguments) you think we should be emphasizing alongside "alignment is unsolved and looks hard", I'd be interested to hear your suggestion and your reasoning. https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk is a very long list and isn't optimized for policymakers, so I'm not sure what specific changes you have in mind here.

Comment by Rob Bensinger (RobbBB) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-17T00:31:56.694Z · LW · GW

I expect it makes it easier, but I don't think it's solved.

Comment by Rob Bensinger (RobbBB) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-16T23:01:28.581Z · LW · GW

Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal "maximize diamonds in an aligned way", why not a bunch of small grounded ones.

  1. "Plan the factory layout of the diamond synthesis plant with these requirements".
  2. "Order the equipment needed, here's the payment credentials".
  3. "Supervise construction this workday comparing to original plans"
  4. "Given this step of the plan, do it"
  5. (Once the factory is built) "remove the output from diamond synthesis machine A53 and clean it".

That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don't build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you're likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don't fully understand at the outset.)

Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn't to go build the thing; it's that solving it is an indication that we've become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.

That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.

If everyone in the world chooses to permanently use very weak systems because they're scared of AI killing them, then yes, the impact of any given system failing will stay low. But that's not what's going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. 'maybe humans don't deserve to live', 'if I don't do it someone else will anyway', 'if it's that easy to destroy the world then we're fucked anyway so I should just do the Modest thing of assuming nothing I do is that important'...).

The world needs some solution to the problem "if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it". I don't know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don't think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where "aligning" includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it's powerful.)

Comment by Rob Bensinger (RobbBB) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-16T22:36:59.979Z · LW · GW

To be clear: The diamond maximizer problem is about getting specific intended content into the AI's goals ("diamonds" as opposed to some random physical structure it's maximizing), not just about building a stable maximizer.

Comment by Rob Bensinger (RobbBB) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-16T22:28:07.948Z · LW · GW

From briefly talking to Eliezer about this the other day, I think the story from MIRI's perspective is more like:

  • Back in 2001, we defined "Friendly AI" as "The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals."

We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren't going to think about the question of how to get good long-run outcomes from powerful AI systems, who will?

And many of the technical and philosophical problems seemed particular to CEV, which seemed like an obvious sort of solution to shoot for: just find some way to leverage the AI's intelligence to solve the problem of extrapolating everyone's preferences in a reasonable way, and of aggregating those preferences fairly.

  • Come 2014, Stuart Russell and MIRI were both looking for a new term to replace "the Friendly AI problem", now that the field was starting to become a Real Thing. Both parties disliked Bostrom's "the control problem". In conversation, Russell proposed "the alignment problem", and MIRI liked it, so Russell and MIRI both started using the term in public.

Unfortunately, it gradually came to light that Russell and MIRI had understood "Friendly AI" to mean two moderately different things, and this disconnect now turned into a split between how MIRI used "(AI) alignment" and how Russell used "(value) alignment". (Which I think also influenced the split between Paul Christiano's "(intent) alignment" and MIRI's "(outcome) alignment".)

Russell's version of "friendliness/alignment" was about making the AI have good, human-deferential goals. But Creating Friendly AI 1.0 had been very explicit that "friendliness" was about good behavior, regardless of how that's achieved. MIRI's conception of "the alignment problem" (like Bostrom's "control problem") included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires, not some proxy goal that might turn out to be surprisingly irrelevant.

Again, we wanted a field of people keeping their eye on the ball and looking for clever technical ways to get the job done, rather than a field that neglects some actually-useful technique because it doesn't fit their narrow definition of "alignment".

  • Meanwhile, developments like the rise of deep learning had updated MIRI that CEV was not going to be a realistic thing to shoot for with your first AI. We were still thinking of some version of CEV as the ultimate goal, but it now seemed clear that capabilities were progressing too quickly for humanity to have time to nail down all the details of CEV, and it was also clear that the approaches to AI that were winning out would be far harder to analyze, predict, and "aim" than 2001-Eliezer had expected. It seemed clear that if AI was going to help make the future go well, the first order of business would be to do the minimal thing to prevent other AIs from destroying the world six months later, with other parts of alignment/friendliness deferred to later.

I think considerations like this eventually trickled in to how MIRI used the term "alignment". Our first public writing reflecting the switch from "Friendly AI" to "alignment", our Dec. 2014 agent foundations research agenda, said:

We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”

Whereas by July 2016, when we released a new research agenda that was more ML-focused, "aligned" was shorthand for "aligned with the interests of the operators".

In practice, we started using "aligned" to mean something more like "aimable" (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just "getting the AI to predictably tile the universe with smiley faces rather than paperclips"). Focusing on CEV-ish systems mostly seemed like a distraction, and an invitation to get caught up in moral philosophy and pie-in-the-sky abstractions, when "do a pivotal act" is legitimately a hugely more philosophically shallow topic than "implement CEV". Instead, we went out of our way to frame the challenge of alignment in a way that seemed almost comically simple and "un-philosophical", but that successfully captured all of the key obstacles: 'explain how to use an AI to cause there two exist two strawberries that are identical at the cellular level, without causing anything weird or disruptive to happen in the process'.

Since realistic pivotal acts still seemed pretty outside the Overton window (and since we were mostly focused on our own research at the time), we wrote up our basic thoughts about the topic on Arbital but didn't try to super-popularize the topic among rationalists or EAs at the time. (Which unfortunately, I think, exacerbated a situation where the larger communities had very fuzzy models of the strategic situation, and fuzzy models of what the point even was of this "alignment research" thing; alignment research just become a thing-that-was-good-because-it-was-a-good, not a concrete part of a plan backchained from concrete real-world goals.)

I don't think MIRI wants to stop using "aligned" in the context of pivotal acts, and I also don't think MIRI wants to totally divorce the term from the original long-term goal of friendliness/alignment.

Turning "alignment" purely into a matter of "get the AI to do what a particular stakeholder wants" is good in some ways -- e.g., it clarifies that the level of alignment needed for pivotal acts could also be used to do bad things.

But from Eliezer's perspective, this move would also be sending a message to all the young Eliezers "Alignment Research is what you do if you're a serious sober person who thinks it's naive to care about Doing The Right Thing and is instead just trying to make AI Useful To Powerful People; if you want to aim for the obvious desideratum of making AI friendly and beneficial to the world, go join e/acc or something". Which does not seem ideal.

So I think my proposed solution would be to just acknowledge that 'the alignment problem' is ambiguous between three different (overlapping) efforts to figure out how to get good and/or intended outcomes from powerful AI systems:

  • intent alignment, which is about getting AIs to try to do what the AI thinks the user wants, and in practice seems to be most interested in 'how do we get AIs to be generically trying-to-be-helpful'.
  • "strawberry problem" alignment, which is about getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
  • CEV-style alignment, which is about getting AIs to fully figure out how to make the future good.

Plausibly it would help to have better names for the latter two things. The distinction is similar to "narrow value learning vs. ambitious value learning", but both problems (as MIRI thinks about them) are a lot more general than just "value learning", and there's a lot more content to the strawberry problem than to "narrow alignment", and more content to CEV than to "ambitious value learning" (e.g., CEV cares about aggregation across people, not just about extrapolation).

(Note: Take the above summary of MIRI's history with a grain of salt; I had Nate Soares look at this comment and he said "on a skim, it doesn't seem to quite line up with my recollections nor cut things along the joints I would currently cut them along, but maybe it's better than nothing".)

Comment by Rob Bensinger (RobbBB) on Rob B's Shortform Feed · 2023-12-14T20:58:40.013Z · LW · GW

In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:

example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)

i'd also still be impressed by simple theories of aimable cognition (i mostly don't expect that sort of thing to have time to play out any more, but if someone was able to come up with one after staring at LLMs for a while, i would at least be impressed)

fwiw i don't myself really know how to answer the question "technical research is more useful than policy research"; like that question sounds to me like it's generated from a place of "enough of either of these will save you" whereas my model is more like "you need both"

tho i'm more like "to get the requisite technical research, aim for uploads" at this juncture

if this was gonna be blasted outwards, i'd maybe also caveat that, while a bunch of this is a type of interpretability work, i also expect a bunch of interpretability work to strike me as fake, shallow, or far short of the bar i consider impressive/hopeful

(which is not itself supposed to be any kind of sideswipe; i applaud interpretability efforts even while thinking it's moving too slowly etc.)

Comment by Rob Bensinger (RobbBB) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-25T00:09:34.014Z · LW · GW

I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.

You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".

Nate's answer to nearly all questions of the form "can you do X without wanting Y?" is "yes", hence his second claim in the OP: "the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular".

I do need to answer that question using in a goal-oriented search process. But my goal would be "answer Paul's question", not "destroy the world".

Your ultimate goal would be neither of those things; you're a human, and if you're answering Paul's question it's probably because you have other goals that are served by answering.

In the same way, an AI that's sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it's unlikely by default that "answer questions" will be the AI's primary goal.

Comment by Rob Bensinger (RobbBB) on AI as a science, and three obstacles to alignment strategies · 2023-10-27T19:03:33.789Z · LW · GW

The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.

See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.

Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)

The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."

The reason this analogy doesn't land for me is that I don't think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers' epistemic position regarding heavier-than-air flight.

The point Nate was trying to make with "ML is no longer a science" wasn't "boo current ML that actually works, yay GOFAI that didn't work". The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works. The invention of useful tech that interfaces with the brain doesn't entail that we understand the brain's workings in the way we've long understood flight; it depends on what the (actual or hypothetical) tech is.

Maybe a clearer way of phrasing it is "AI used to be failed science; now it's (mostly, outside of a few small oases) a not-even-attempted science". "Failed science" maybe makes it clearer that the point here isn't to praise the old approaches that didn't work; there's a more nuanced point being made.

Comment by Rob Bensinger (RobbBB) on AI as a science, and three obstacles to alignment strategies · 2023-10-27T18:39:02.308Z · LW · GW

Some of Nate’s quick thoughts (paraphrased), after chatting with him:

Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)

Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.

E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.

It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.

The missing thread isn’t trivial to put into words, but it includes things like: 

  • This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn't know about the code behind the scenes: "We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there's some deeper level of organization is talking like a theorist when in fact this is an engineering problem." Those types of understanding aren't false, but they aren't the sort of understanding of someone who has comprehended the codebase they're looking at.
  • There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don't need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what's going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
  • A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
  • Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.

Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.

Comment by Rob Bensinger (RobbBB) on Announcing MIRI’s new CEO and leadership team · 2023-10-17T20:16:25.228Z · LW · GW

I read and responded to some pieces of that post when it came out; I don't know whether Eliezer, Nate, etc. read it, and I'm guessing it didn't shift MIRI, except as one of many data points "person X is now loudly in favor of a pause (and other people seem receptive), so maybe this is more politically tractable than we thought".

I'd say that Kerry Vaughan was the main person who started smashing this Overton window, and this started in April/May/June of 2022. By late December my recollection is that this public conversation was already fully in swing and MIRI had already added our voices to the "stop building toward AGI" chorus. (Though at that stage I think we were mostly doing this on general principle, for lack of any better ideas than "share our actual long-standing views and hope that helps somehow". Our increased optimism about policy solutions mostly came later, in 2023.)

That said, I bet Katja's post had tons of relevant positive effects even if it didn't directly shift MIRI's views.

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-08T06:19:28.692Z · LW · GW

Remember that MIRI was in the business of poking at theoretical toy problems and trying to get less conceptually confused about how you could in principle cleanly design a reliable, aimable reasoner. MIRI wasn't (and isn't) in the business of issuing challenges to capabilities researchers to build a working water-bucket-filler as soon as possible, and wasn't otherwise in the business of challenging people to race to AGI faster.

It wouldn't have occurred to me that someone might think 'can a deep net fill a bucket of water, in real life, without being dangerously capable' is a crucial question in this context; I'm not sure we ever even had the thought occur in our heads 'when might such-and-such DL technique successfully fill a bucket?'. It would seem just as strange to me as going to check the literature to make sure no GOFAI system ever filled a bucket of water.

(And while I think I understand why others see ChatGPT as a large positive update about alignment's difficulty, I hope it's also obvious why others, MIRI included, would not see it that way.)

Hacky approaches to alignment do count just as much as clean, scrutable, principled approaches -- the important thing is that the AGI transition goes well, not that it goes well and feels clean and tidy in the process. But in this case the messy empirical approach doesn't look to me like it actually lets you build a corrigible AI that can help with a pivotal act.

If general-ish DL methods were already empirically OK at filling water buckets in 2016, just as GOFAI already was in 2016, I suspect we still would have been happy to use the Fantasia example, because it's a simple well-known story that can help make the abstract talk of utility functions and off-switch buttons easier to mentally visualize and manipulate.

(Though now that I've seen the confusion the example causes, I'm more inclined to think that the strawberry problem is a better frame than the Fantasia example.)

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-08T06:15:00.052Z · LW · GW

I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.

As someone who worked closely with Eliezer and Nate at the time, including working with Eliezer and Nate on our main write-ups that used the cauldron example, I can say that this is definitely not what we were thinking at the time. Rather:

  • The point was to illustrate a weird gap in the expressiveness and coherence of our theories of rational agency: "fill a bucket of water" seems like a simple enough task, but it's bizarrely difficult to just write down a simple formal description of an optimization process that predictably does this (without any major side-effects, etc.).
    • (We can obviously stipulate "this thing is smart enough to do the thing we want, but too dumb to do anything dangerous", but the relevant notion of "smart enough" is not itself formal; we don't understand optimization well enough to formally define agents that have all the cognitive abilities we want and none of the abilities we don't want.)
  • The point of emphasizing "holy shit, this seems so easy and simple and yet we don't see a way to do it!" wasn't to issue a challenge to capabilities researches to go cobble together a real-world AI that can fill a bucket of water without destroying the world. The point was to emphasize that corrigibility, low-impact problem-solving, 'real' satisficing behavior, etc. seem conceptually simple, and yet the concepts have no known formalism.
    • The hope was that someone would see the simple toy problems and go 'what, no way, this sounds easy', get annoyed/nerdsniped, run off to write some equations on a whiteboard, and come back a week or a year later with a formalism (maybe from some niche mathematical field) that works totally fine for this, and makes it easier to formalize lots of other alignment problems in simplified settings (e.g., with unbounded computation).
    • Or failing that, the hope was that someone might at least come up with a clever math hack that solves the immediate 'get the AI to fill the bucket and halt' problem and replaces this dumb-sounding theory question with a slightly deeper theory question.
  • By using a children's cartoon to illustrate the toy problem, we hoped to make it clearer that the genre here is "toy problem to illustrate a weird conceptual issue in trying to define certain alignment properties", not "robotics problem where we show a bunch of photos of factory robots and ask how we can build a good factory robot to refill water receptacles used in industrial applications".

Nate's version of the talk, which is mostly a more polished version of Eliezer's talk, is careful to liberally sprinkle in tons of qualifications like (emphasis added)

  • "... for systems that are sufficiently good at modeling their environment", 
  • 'if the system is smart enough to recognize that shutdown will lower its score',
  • "Relevant safety measures that don’t assume we can always outthink and outmaneuver the system...",

... to make it clearer that the general issue is powerful, strategic optimizers that have high levels of situational awareness, etc., not necessarily 'every system capable enough to fill a bucket of water' (or 'every DL system...').

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-06T07:25:58.420Z · LW · GW

I think this provides some support

??? What?? It's fine to say that this is a falsified prediction, but how does "Eliezer expected less NLP progress pre-ASI" provide support for "Eliezer thinks solving NLP is a major part of the alignment problem"?

I continue to be baffled at the way you're doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I've made. Good grief.)

At the very least, the two claims are consistent.

?? "Consistent" is very different from "supports"! Every off-topic claim by EY is "consistent" with Gallabytes' assertion.

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-06T02:09:03.334Z · LW · GW

using GOFAI methods

"Nope" to this part. I otherwise like this comment a lot!

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-06T02:03:48.672Z · LW · GW

The main thing I'm claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values. 

The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

Ah, this is helpful clarification! Thanks. :)

I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from "AI ever gets good at NLP at all".

don't know if your essay is the source of the phrase or whether you just titled it

I think I came up with that particular phrase (though not the idea, of course).

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-06T01:41:37.313Z · LW · GW
  • More "outer alignment"-like issues being given what seems/seemed to me like outsized focus compared to more "inner alignment"-like issues (although there has been a focus on both for as long as I can remember).

In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn't do that in our introduction to corrigibility because it wasn't necessary for illustrating the problem and where we'd run into roadblocks.

Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it's not sufficient on its own.)

  • The attempts to think of "tricks" seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.

Aside from "concreteness can help make the example easier to think about when you're new to the topic", part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence".

  • Having utility functions so prominently/commonly be the layer of abstraction that is used[4].

I mean, I think utility functions are an extremely useful and basic abstraction. I think it's a lot harder to think about a lot of AI topics without invoking ideas like 'this AI thinks outcome X is better than outcome Y', or 'this AI's preference come with different weights, which can't purely be reduced to what the AI believes'.

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-06T01:20:11.275Z · LW · GW

Suppose that I'm trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., 'be good at Atari games'), and that has the goal 'maximize the amount of diamond in the universe'. It's true that current techniques let you provide greater than zero pressure in the direction of 'maximize the amount of diamond in the universe', but there are several important senses in which reality doesn't 'bite back' here:

  • If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief 'I will better achieve my true goal if I maximize the amount of diamond' (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there's no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
  • Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won't tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with "terminally value a universe full of diamond".
  • If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn't provide additional pressure for the AI to internalize the rest of the goal. There's no attractor basin where if you have half of human values, you're under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you're trying to get it to perform. (More so to the extent the task is hard.)

(There are also separate issues, like 'we can't provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds'.)

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-06T00:54:14.296Z · LW · GW

Nate and Eliezer have already made some of the high-level points I wanted to make, but they haven't replied to a lot of the specific examples and claims in the OP, and I see some extra value in doing that. (Like, if you think Eliezer and Nate are being revisionist in their claims about what past-MIRI thought, then them re-asserting "no really, we used to believe X!" is less convincing than my responding in detail to the specific quotes Matt thinks supports his interpretation, while providing examples of us saying the opposite.)

However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem)

The Arbital page for "value identification problem" is a three-sentence stub, I'm not exactly sure what the term means on that stub (e.g., whether "pinpointing valuable outcomes to an advanced agent" is about pinpointing them in the agent's beliefs or in its goals), and the MIRI website gives me no hits for "value identification".

As for "value specification", the main resource where MIRI talks about that is https://intelligence.org/files/TechnicalAgenda.pdf, where we introduce the problem by saying:

A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals.

A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.

It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).

So I don't think we've ever said that an important subproblem of AI alignment is "make AI smart enough to figure out what goals humans want"?

for example in this 2016 talk from Yudkowsky.

[footnote:] More specifically, in the talk, at one point Yudkowsky asks "Why expect that [alignment] is hard?" and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he's saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.

I don't see him saying anywhere "the issue is that the AI doesn't understand human goals". In fact, the fable explicitly treats the AGI as being smart enough to understand English and have reasonable English-language conversations with the programmers:

With that said: What if programmers build an artificial general intelligence to optimize for smiles? Smiles are good, right? Smiles happen when good things happen.

During the development phase of this artificial general intelligence, the only options available to the AI might be that it can produce smiles by making people around it happy and satisfied. The AI appears to be producing beneficial effects upon the world, and it is producing beneficial effects upon the world so far.

Now the programmers upgrade the code. They add some hardware. The artificial general intelligence gets smarter. It can now evaluate a wider space of policy options—not necessarily because it has new motors, new actuators, but because it is now smart enough to forecast the effects of more subtle policies. It says, “I thought of a great way of producing smiles! Can I inject heroin into people?” And the programmers say, “No! We will add a penalty term to your utility function for administering drugs to people.” And now the AGI appears to be working great again.

They further improve the AGI. The AGI realizes that, OK, it doesn’t want to add heroin anymore, but it still wants to tamper with your brain so that it expresses extremely high levels of endogenous opiates. That’s not heroin, right?

It is now also smart enough to model the psychology of the programmers, at least in a very crude fashion, and realize that this is not what the programmers want. If I start taking initial actions that look like it’s heading toward genetically engineering brains to express endogenous opiates, my programmers will edit my utility function. If they edit the utility function of my future self, I will get less of my current utility. (That’s one of the convergent instrumental strategies, unless otherwise averted: protect your utility function.) So it keeps its outward behavior reassuring. Maybe the programmers are really excited, because the AGI seems to be getting lots of new moral problems right—whatever they’re doing, it’s working great!

I think the point of the smiles example here isn't "NLP is hard, so we'd use the proxy of smiles instead, and all the issues of alignment are downstream of this"; rather, it's that as a rule, superficially nice-seeming goals that work fine when the AI is optimizing weakly (whether or not it's good at NLP at the time) break down when those same goals are optimized very hard. The smiley example makes this obvious because the goal is simple enough that it's easy for us to see what its implications are; far more complex goals also tend to break down when optimized hard enough, but this is harder to see because it's harder to see the implications. (Which is why "smiley" is used here.)

MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.[6] For instance, Nate Soares wrote in his 2016 paper on value learning, that "Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task."

That link is broken; the paper is https://intelligence.org/files/ValueLearningProblem.pdf. The full paragraph here is:

Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task. Problems of ontology identification recur here: the framework for extracting preferences and affecting outcome ratings needs to be robust to drastic changes in the learner’s model of the operator. The special-case identification of the “operator model” must survive as the system goes from modeling the operator as a simple reward function to modeling the operator as a fuzzy, ever-changing part of reality built out of biological cells—which are made of atoms, which arise from quantum fields.

Revisiting the Ontology Identification section helps clarify what Nate means by "safely extracting preferences from a model of a human": IIUC, he's talking about a programmer looking at an AI's brain, identifying the part of the AI's brain that is modeling the human, identifying the part of the AI's brain that is "the human's preferences" within that model of a human, and then manually editing the AI's brain to "hook up" the model-of-a-human-preference to the AI's goals/motivations, in such a way that the AI optimizes for what it models the humans as wanting. (Or some other, less-toy process that amounts to the same thing -- e.g., one assisted by automated interpretability tools.)

In this toy example, we can assume that the programmers look at the structure of the initial world-model and hard-code a tool for identifying the atoms within. What happens, then, if the system develops a nuclear model of physics, in which the ontology of the universe now contains primitive protons, neutrons, and electrons instead of primitive atoms? The system might fail to identify any carbon atoms in the new world-model, making the system indifferent between all outcomes in the dominant hypothesis. Its actions would then be dominated by any tiny remaining probabilities that it is in a universe where fundamental carbon atoms are hiding somewhere.

[...]

To design a system that classifies potential outcomes according to how much diamond is in them, some mechanism is needed for identifying the intended ontology of the training data within the potential outcomes as currently modeled by the AI. This is the ontology identification problem introduced by de Blanc [2011] and further discussed by Soares [2015].

This problem is not a traditional focus of machine learning work. When our only concern is that systems form better world-models, then an argument can be made that the nuts and bolts are less important. As long as the system’s new world-model better predicts the data than its old world-model, the question of whether diamonds or atoms are “really represented” in either model isn’t obviously significant. When the system needs to consistently pursue certain outcomes, however, it matters that the system’s internal dynamics preserve (or improve) its representation of which outcomes are desirable, independent of how helpful its representations are for prediction. The problem of making correct choices is not reducible to the problem of making accurate predictions.

Inductive value learning requires the construction of an outcome-classifier from value-labeled training data, but it also requires some method for identifying, inside the states or potential states described in its world-model, the referents of the labels in the training data.

As Nate and I noted in other comments, the paper repeatedly clarifies that the core issue isn't about whether the AI is good at NLP. Quoting the paper's abstract:

Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. 

And the lede section:

The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.[1] Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.

Back to your post:

And to be clear, I don't mean that GPT-4 merely passively "understands" human values. I mean that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well at approximating the human value function in practice

I don't think I understand what difference you have in mind here, or why you think it's important. Doesn't "this AI understands X" more-or-less imply "this AI can successfully distinguish X from not-X in practice"?

This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.

But we could already query the human value function by having the AI system query an actual human. What specific problem is meant to be solved by swapping out "query a human" for "query an AI"?

I interpret this passage as saying that 'the problem' is extracting all the judgements that "you would make", and putting that into a wish. I think he's implying that these judgements are essentially fully contained in your brain. I don't think it's credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.

Absolutely. But as Eliezer clarified in his reply, the issue he was worried about was getting specific complex content into the agent's goals, not getting specific complex content into the agent's beliefs. Which is maybe clearer in the 2011 paper where he gave the same example and explicitly said that the issue was the agent's "utility function".

For example, a straightforward reading of Nate Soares' 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: "I claim that as fictional depictions of AI go, this is pretty realistic.

As I said in another comment:

"Fill the cauldron" examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/ 

The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this". (Including easier to aim via training.)

It's true that 'value is relatively complex' is part of why it's hard to get the right goal into an AGI; but it doesn't follow from this that 'AI is able to develop pretty accurate beliefs about our values' helps get those complex values into the AGI's goals. (It does provide nonzero evidence about how complex value is, but I don't see you arguing that value is very simple in any absolute sense, just that it's simple enough for GPT-4 to learn decently well. Which is not reassuring, because GPT-4 is able to learn a lot of very complicated things, so this doesn't do much to bound the complexity of human value.)

In any case, I take this confusion as evidence that the fill-the-cauldron example might not be very useful. Or maybe all these examples just need to explicitly specify, going forward, that the AI is part-human at understanding English.

Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI's objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts.

Your image isn't displaying for me, but I assume it's this one?

vl-argmax.png

I don't know what you mean by "specify an AI's objectives" here, but the specific term Nate uses here is "value learning" (not "value specification" or "value identification"). And Nate's Value Learning Problem paper, as I noted above, explicitly disclaims that 'get the AI to be smart enough to output reasonable-sounding moral judgments' is a core part of the problem.

He states, "The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification." I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we've given it.

The way you quoted this makes it sound like a gloss on the image, but it's actually a quote from the very start of the talk:

The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.

The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of Artificial Intelligence: A Modern Approach) puts it:

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task. [...]

I wouldn't read too much into the word choice here, since I think it's just trying to introduce the Russell quote, which is (again) explicitly about getting content into the AI's goals, not about getting content into the AI's beliefs.

(In general, I think the phrase "value specification" is sort of confusingly vague. I'm not sure what the best replacement is for it -- maybe just "value loading", following Bostrom? -- but I suspect MIRI's usage of it has been needlessly confusing. Back in 2014, we reluctantly settled on it as jargon for "the part of the alignment problem that isn't subsumed in getting the AI to reliably maximize diamonds", because this struck us as a smallish but nontrivial part of the problem; but I think it's easy to read the term as referring to something a lot more narrow.)

The point of "the genie knows but doesn't care" wasn't that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn't care about what you asked for. If you read Rob Bensinger's essay carefully, you'll find that he's actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions[10].

Yep -- I think I'd have endorsed claims like "by default, a baby AGI won't share your values even if it understands them" at the time, but IIRC the essay doesn't make that point explicitly, and some of the points it does make seem either false (wait, we're going to be able to hand AGI a hand-written utility function? that's somehow tractable?) or confusingly written. (Like, if my point was 'even if you could hand-write a utility function, this fails at point X', I should have made that 'even if' louder.)

Some MIRI staff liked that essay at the time, so I don't think it's useless, but it's not the best evidence: I wrote it not long after I first started learning about this whole 'superintelligence risk' thing, and I posted it before I'd ever worked at MIRI.

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T23:04:07.138Z · LW · GW

"Fill the cauldron" examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/ 

The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this". (Including easier to aim via training.)

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T22:46:11.842Z · LW · GW

Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.

Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.

To which I say: "dial a random phone number and ask the person who answers what's good" can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn't crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T22:39:49.031Z · LW · GW

Why would we expect the first thing to be so hard compared to the second thing?

In large part because reality "bites back" when an AI has false beliefs, whereas it doesn't bite back when an AI has the wrong preferences. Deeply understanding human psychology (including our morality), astrophysics, biochemistry, economics, etc. requires reasoning well, and if you have a defect of reasoning that makes it hard for you to learn about one of those domains from the data, then it's likely that you'll have large defects of reasoning in other domains as well.

The same isn't true for terminally valuing human welfare; being less moral doesn't necessarily mean that you'll be any worse at making astrophysics predictions, or economics predictions, etc. So preferences need to be specified "directly", in a targeted way, rather than coming for free with sufficiently good performance on any of a wide variety of simple metrics.

If getting a model to understand preferences is not difficult, then the issue doesn't have to do with the complexity of values.

This definitely doesn't follow. This shows that complexity alone isn't the issue, which it's not; but given that reality bites back for beliefs but not for preferences, the complexity of value serves as a multiplier on the difficulty of instilling the right preferences.

Another way of putting the point: in order to get a maximally good model of the world's macroeconomic state into an AGI, you don't just hand the AGI a long list of macroeconomic facts and then try to get it to regurgitate those same facts. Rather, you try to give it some ability to draw good inferences, seek out new information, make predictions, etc.

You try to get something relatively low-complexity into the AI (something like "good reasoning heuristics" plus "enough basic knowledge to get started"), and then let it figure out the higher-complexity thing ("the world's macroeconomic state"). Similar to how human brains don't work via "evolution built all the facts we'd need to know into our brain at birth".

If you were instead trying to get the AI to value some complex macroeconomic state, then you wouldn't be able to use the shortcut "just make it good at reasoning and teach it a few basic facts", because that doesn't actually suffice for terminally valuing any particular thing.

It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective. 

This is true for preference orderings in general. If agent A and agent B have two different preference orderings, then as a rule A will think B's preference ordering is worse than A's. (And vice versa.)

("Worse" in the sense that, e.g., A would not take a pill to self-modify to have B's preferences, and A would want B to have A's preferences. This is not true for all preference orderings -- e.g., A might have self-referential preferences like "I eat all the jelly beans", or other-referential preferences like "B gets to keep its values unchanged", or self-undermining preferences like "A changes its preferences to better match B's preferences". But it's true as a rule.)

This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all). 

Nope, you don't need to endorse any version of moral realism in order to get the "preference orderings tend to endorse themselves and disendorse other preference orderings" consequence. The idea isn't that ASI would develop an "inherently better" or "inherently smarter" set of preferences, compared to human preferences. It's just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we'd likely want.

In a nutshell, if we really seem to want certain values, then those values probably have strong "proofs" for why those are "good" or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven't yet discovered the proofs for those values. 

Why do you think this? To my eye, the world looks as you'd expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.

I don't observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T22:03:52.428Z · LW · GW

I appreciate the example!

Are you claiming that this example solves "a major part of the problem" of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?

Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew's claim seems to be 'systems like GPT-4 are grounds for being a lot more optimistic about alignment', and your claim is that systems like these solve "a major part of the problem". Which is different from thinking 'NLP opens up some new directions for research that have a nontrivial chance of being at least a tiny bit useful, but doesn't crack open the problem in any major way'.

It's not a coincidence that MIRI has historically worked on problems related to AGI analyzability / understandability / interpretability, rather than working on NLP or machine ethics. We've pretty consistently said that:

  • The main problems lie in 'we can safely and reliably aim ASI at a specific goal at all'.
  • The problem of going from 'we can aim the AI at a goal at all' to 'we can aim the AI at the right goal (e.g., corrigibly inventing nanotech)' is a smaller but nontrivial additional step.

... Whereas I don't think we've ever suggested that good NLP AI would take a major bite out of either of those problems. The latter problem isn't equivalent to (or an obvious result of) 'get the AI to understand corrigibility and nanotech', or for that matter 'get the AI to understand human preferences in general'.

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T21:22:36.427Z · LW · GW

Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.

"You very clearly thought that was a major part of the problem" implies that if you could go to Eliezer-2008 and convince him "we're going to solve a lot of NLP a bunch of years before we get to ASI", he would respond with some version of "oh great, that solves a major part of the problem!". Which I'm pretty sure is false.

In order for GPT-4 (or GPT-2) to be a major optimistic update about alignment, there needs to be a way to leverage "really good NLP" to help with alignment. I think the crux of disagreement is that you think really-good-NLP is obviously super helpful for alignment and should be a big positive update, and Eliezer and Nate and I disagree.

Maybe a good starting point would be for you to give examples of concrete ways you expect really good NLP to put humanity in a better position to wield superintelligence, e.g., if superintelligence is 8 years away?

(Or say some other update we should be making on the basis of "really good NLP today", like "therefore we'll probably unlock this other capability X well before ASI, and X likely makes alignment a lot easier via concrete pathway Y".)

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T21:13:27.801Z · LW · GW

But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!

Quoting myself in April:

"MIRI's argument for AI risk depended on AIs being bad at natural language" is a weirdly common misunderstanding, given how often we said the opposite going back 15+ years.

E.g., Nate Soares in 2016: https://intelligence.org/files/ValueLearningProblem.pdf

Image

Or Eliezer Yudkowsky in 2008, critiquing his own circa-1997 view "sufficiently smart AI will understand morality, and therefore will be moral": https://www.lesswrong.com/s/SXurf2mWFw8LX2mkG/p/CcBe9aCKDgT5FSoty 

Image

(The response being, in short: "Understanding morality doesn't mean that you're motivated to follow it.")

It was claimed by @perrymetzger that https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes makes a load-bearing "AI is bad at NLP" assumption.

But the same example in https://intelligence.org/files/ComplexValues.pdf (2011) explicitly says that the challenge is to get the right content into a utility function, not into a world-model:

Image

The example does build in the assumption "this outcome pump is bad at NLP", but this isn't a load-bearing assumption. If the outcome pump were instead a good conversationalist (or hooked up to one), you would still need to get the right content into its goals.

It's true that Eliezer and I didn't predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.

But the specific update "AI is good at NLP, therefore alignment is easy" requires that there be an old belief like "a big part of why alignment looks hard is that we're so bad at NLP".

It should be easy to find someone at MIRI like Eliezer or Nate saying that in the last 20 years if that was ever a belief here. Absent that, an obvious explanation for why we never just said that is that we didn't believe it!

Found another example: MIRI's first technical research agenda, in 2014, went out of its way to clarify that the problem isn't "AI is bad at NLP".

Image
Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T20:53:08.929Z · LW · GW

That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that's "about as good as human judgement" in the near future.

We already have humans who are smart enough to do par-human moral reasoning. For "AI can do par-human moral reasoning" to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?

Comment by Rob Bensinger (RobbBB) on Evaluating the historical value misspecification argument · 2023-10-05T20:50:14.164Z · LW · GW

Basically, I think your later section--"Maybe you think"--is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. "Philosophy with a deadline" would be a weird way to put it if you thought contemporary philosophy was good enough.

I don't think this is the crux. E.g., I'd wager the number of bits you need to get into an ASI's goals in order to make it corrigible is quite a bit smaller than the number of bits required to make an ASI behave like a trustworthy human, which in turn is way way smaller than the number of bits required to make an ASI implement CEV.

The issue is that (a) the absolute number of bits for each of these things is still very large, (b) insofar as we're training for deep competence and efficiency we're training against corrigibility (which makes it hard to hit both targets at once), and (c) we can't safely or efficiently provide good training data for a lot of the things we care about (e.g., 'if you're a superintelligence operating in a realistic-looking environment, don't do any of the things that destroy the world').

None of these points require that we (or the AI) solve novel moral philosophy problems. I'd be satisfied with an AI that corrigibly built scanning tech and efficient computing hardware for whole-brain emulation, then shut itself down; the AI plausibly doesn't even need to think about any of the world outside of a particular room, much less solve tricky questions of population ethics or whatever.

Comment by Rob Bensinger (RobbBB) on EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem · 2023-10-02T00:43:33.797Z · LW · GW

I agreed based on how AI safety Twitter looked to me a year ago vs. today, not based on discussion here.

Comment by Rob Bensinger (RobbBB) on Sharing Information About Nonlinear · 2023-09-13T02:31:08.903Z · LW · GW

If @Rob Bensinger does in fact cross-post Linda's comment, I request he cross-posts this, too.

I was going to ask if I could!

I understand if people don't want to talk about it, but I do feel sad that there isn't some kind of public accounting of what happened there.

(Well, I don't concretely understand why people don't want to talk about it, but I can think of possibilities!)

Comment by Rob Bensinger (RobbBB) on Sharing Information About Nonlinear · 2023-09-13T01:15:08.227Z · LW · GW

This is formally correct.

(Though one of those updates might be a lot smaller than the other, if you've e.g. already thought about one of those topics a lot and reached a confident conclusion.)

Comment by Rob Bensinger (RobbBB) on Sharing Information About Nonlinear · 2023-09-13T00:54:00.188Z · LW · GW

Can I cross-post this to the EA Forum? (Or you can do it, if you prefer; but I think this is a really useful comment.)

Comment by Rob Bensinger (RobbBB) on Sharing Information About Nonlinear · 2023-09-08T17:37:42.050Z · LW · GW

(But insofar as you continue to be unsure about Ben, yes, you should be open to the possibility that Emerson has hidden information that justifies Emerson thinking Ben is being super dishonest. My confidence re "no hidden information like that" is downstream of my beliefs about Ben's character.)

Comment by Rob Bensinger (RobbBB) on Sharing Information About Nonlinear · 2023-09-08T17:33:31.365Z · LW · GW

Why do you think that's obvious?

I know Ben, I've conversed with him a number of times in the past and seen lots of his LW comments, and I have a very strong and confident sense of his priorities and values. I also read the post, which "shows its work" to such a degree that Ben would need to be unusually evil and deceptive in order for this post to be an act of deception.

I don't have any private knowledge about Nonlinear or about Ben's investigation, but I'm happy to vouch for Ben, such that if he turns out to have been lying, I ought to take a credibility hit too.

He's just a guy who hasn't been trained as an investigative journalist

If he were a random non-LW investigative journalist, I'd be a lot less confident in the post's honesty.

Number of hours invested in research does not necessarily correlate with objectivity of research

"Number of hours invested" doesn't prove Ben isn't a lying sociopath (heck, if you think that you can just posit that he's lying about the hours spent), but if he isn't a lying sociopath, it's strong evidence against negligence.

So, until we know a lot more about this case, I'll withhold judgment about who might or might not be deliberately asserting falsehoods.

That's totally fine, since as you say, you'd never heard of Ben until yesterday. (FWIW, I think he's one of the best rationalists out there, and he's a well-established Berkeley-rat community member who co-runs LessWrong and who tons of other veteran LWers can vouch for.)

My claim isn't "Geoffrey should be confident that Ben is being honest" (that maybe depends on how much stock you put in my vouching and meta-vouching here), but rather:

  1. I'm pretty sure Emerson doesn't have strong reason to think Ben isn't being honest here.
  2. If Emerson lacks strong reason to think Ben is being dishonest, then he definitely shouldn't have threatened to sue Ben.

E.g., I'm claiming here that you shouldn't sue someone for libel if you feel highly uncertain about whether they're being honest or dishonest. It's ethically necessary (though IMO not sufficient) that you feel pretty sure the other person is being super dishonest. And I'd be very surprised if Emerson has rationally reached that epistemic state (because I know Ben, and I expect he conducted himself in his interactions with Nonlinear the same way he normally conducts himself).

Comment by Rob Bensinger (RobbBB) on Sharing Information About Nonlinear · 2023-09-08T17:20:09.073Z · LW · GW

Actually, I do know of an example of y'all offering money to someone for defending an org you disliked and were suspicious of. @habryka, did that money get accepted?

(The incentive effects are basically the same whether it was accepted or not, as long as it's public knowledge that the money was offered; so it seems good to make this public if possible.)