Late 2021 MIRI Conversations: AMA / Discussion

post by Rob Bensinger (RobbBB) · 2022-02-28T20:03:05.318Z · LW · GW · 199 comments

With the release of Rohin Shah and Eliezer Yudkowsky's conversation [LW · GW], the Late 2021 MIRI Conversations sequence is now complete.

This post is intended as a generalized comment section for discussing the whole sequence, now that it's finished. Feel free to:

In particular, Eliezer Yudkowsky, Richard Ngo, Paul Christiano, Nate Soares, and Rohin Shah expressed active interest in receiving follow-up questions here. The Schelling time when they're likeliest to be answering questions is Wednesday March 2, though they may participate on other days too.

199 comments

Comments sorted by top scores.

comment by Vaniver · 2022-03-01T20:48:52.603Z · LW(p) · GW(p)

This is mostly in response to stuff written by Richard, but I'm interested in everyone's read of the situation.

While I don't find Eliezer's core intuitions about intelligence too implausible, they don't seem compelling enough to do as much work as Eliezer argues they do. As in the Foom debate, I think that our object-level discussions were constrained by our different underlying attitudes towards high-level abstractions, which are hard to pin down (let alone resolve).

Given this, I think that the most productive mode of intellectual engagement with Eliezer's worldview going forward is probably not to continue debating it (since that would likely hit those same underlying disagreements), but rather to try to inhabit it deeply enough to rederive his conclusions and find new explanations of them which then lead to clearer object-level cruxes.

I'm not sure yet how to word this as a question without some introductory paragraphs. When I read Eliezer, I often feel like he has a coherent worldview that sees lots of deep connections and explains lots of things, and that he's actively trying to be coherent / explain everything. [This is what I think you're pointing to with his 'attitude towards high-level abstractions'.]

When I read other people, I often feel like they're operating in a 'narrower segment of their model', or not trying to fit the whole world at once, or something. They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'

Why is this?

  • Just a difference in articulation or cultural style? (Like, people have complete mental models, they just aren't as good at or less interested in exposing the pieces as Eliezer is.)
  • A real difference in functioning? (Certainly there are sentences that I emit which are not 'on my mainline', because I'm trying to achieve some end besides the 'predict the world accurately' end, and while I think my mental universe has lots of detail and models I don't have the sense that it's as coherent as Eliezer's mental universe.)
  • The thing I think is happening with Eliezer is illusory? (In fact he's operating narrow models like everyone else, he just has more confidence that those models apply broadly.)

I notice I'm still a little stuck on this comment [LW(p) · GW(p)] from earlier, where I think Richard had a reasonable response to my complaint on the object-level (indeed, strong forces opposed to technological progress makes sense, as do them not necessarily being rational or succeeding in every instance), but there's still some meta-level mismatch. Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with 'since you're being overly pessimistic, I will be overly optimistic to balance', with no attempt to have his response match his own mainline. And then when Eliezer responded with:

But there's a really really basic lesson here about the different style of "sentences found in political history books" rather than "sentences produced by people imagining ways future politics could handle an issue successfully".

the subject got changed.

But I'm still deeply interested in the really really basic lesson, and how deeply it's been grokked by everyone involved!

Replies from: paulfchristiano, rohinmshah, Vaniver, ricraz, dxu
comment by paulfchristiano · 2022-03-02T15:20:27.793Z · LW(p) · GW(p)

I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict").  At a high level I don't think "mainline" is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what "mainline" means), and that neat stories that fit everything usually don't work well (unless, or often even if, generated in hindsight).

In answer to your "why is this," I think it's a combination of moderate differences in functioning and large differences in communication style. I think Eliezer has a way of thinking about the future that is quite different from mine and I'm somewhat skeptical of and feel like Eliezer is overselling (which is what got me into this discussion), but that's probably smaller than a large difference in communication style (driven partly by different skills, different aesthetics, and different ideas about what kinds of standards discourse should aspire to).

I think I may not understand well the basic lesson / broader point, so will probably be more helpful on object level points and will mostly go answer those in the time I have.

Replies from: Vaniver
comment by Vaniver · 2022-03-03T01:57:40.612Z · LW(p) · GW(p)

I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict").

Sometimes I'll be tracking a finite number of "concrete hypotheses", where every hypothesis is 'fully fleshed out', and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes they get ruled out or need to split, or so on. In those cases, I'm moderately confident that every 'hypothesis' corresponds to a 'real world', constrained by how well as I can get my imagination to correspond to reality. [A 'finite number' depends on the situation, but I think it's normally something like 2-5, unless it's an area I've built up a lot of cache about.]

Sometimes I'll be tracking a bunch of "surface-level features", where the distributions on the features don't always imply coherent underlying worlds, either on their own or in combination with other features. (For example, I might have guesses about the probability that a random number is odd and a different guess about the probability that a random number is divisible by 3 and, until I deliberately consider the joint probability distribution, not have any guarantee that it'll be coherent.)

Normally I'm doing something more like a mixture of those, which I think of as particles of incomplete world models, with some features pinned down and others mostly 'surface-level features'. I can often simultaneously consider many more of these; like, when I'm playing Go, I might be tracking a dozen different 'lines of attack', which have something like 2-4 moves clearly defined and the others 'implied' (in a way that might not actually be consistent).

Are any of those like your experience? Or is there some other way you'd describe it?

different ideas about what kinds of standards discourse should aspire to

Have you written about this / could you? I'd be pretty excited about being able to try out discoursing with people in a Paul-virtuous way.

Replies from: paulfchristiano
comment by paulfchristiano · 2022-03-03T02:39:34.570Z · LW(p) · GW(p)

I think my way of thinking about things is often a lot like "draw random samples," more like drawing N random samples rather than particle filtering (I guess since we aren't making observations as we go---if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).

The main complexity feels like the thing you point out where it's impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and then refine those intuitions only periodically when you actually try to flesh something out and see if it makes sense. And often you go even further and just talk about relationships amongst surface level features using intuitions refined from a bunch of samples.

I feel like a distinctive feature of Eliezer's dialog w.r.t. foom / alignment difficulty is that he has a lot of views about strong regularities that should hold across all of these worlds. And then disputes about whether worlds are plausible often turn on things like "is this property of the described world likely?" which is tough because obviously everyone agrees that every particular world is unlikely. To Eliezer it seems obvious that the feature is improbable (because it was just produced by seeing where the world violated the strong regularity he believes in), whereas to the other person it just looks like one of many scenarios that is implausible only in its concrete details. And then this isn't well-resolved by "just talk about your mainline" because the "mainline" is a distribution over worlds which are all individually improbable (for either Eliezer or for others).

This is all a bit of a guess though / rambling speculation.

Replies from: Vaniver
comment by Vaniver · 2022-03-03T03:42:19.651Z · LW(p) · GW(p)

I think my way of thinking about things is often a lot like "draw random samples," more like drawing N random samples rather than particle filtering (I guess since we aren't making observations as we go---if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).

Oh whoa, you don't remember your samples from before? [I guess I might not either, unless I'm concentrating on keeping them around or verbalized them or something; probably I do something more expert-iteration-like where I'm silently updating my generating distributions based on the samples and then resampling them in the future.]

To Eliezer it seems obvious that the feature is improbable (because it was just produced by seeing where the world violated the strong regularity he believes in), whereas to the other person it just looks like one of many scenarios that is implausible only in its concrete details. And then this isn't well-resolved by "just talk about your mainline" because the "mainline" is a distribution over worlds which are all individually improbable (for either Eliezer or for others).

Yeah, this seems likely; this makes me more interested in the "selectively ignoring variables" hypothesis for why Eliezer running this strategy might have something that would naturally be called a mainline. [Like, it's very easy to predict "number of apples sold = number of apples bought" whereas it's much harder to predict the price of apples.] But maybe instead he means it in the 'startup plan' sense, where you do actually assign basically no probability to your mainline prediction, but still vastly more than any other prediction that's equally conjunctive.

comment by Rohin Shah (rohinmshah) · 2022-03-02T16:43:56.506Z · LW(p) · GW(p)

EDIT: I wrote this before seeing Paul's response; hence a significant amount of repetition.

They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'

Why is this?

Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like "in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe", I'm obviously not claiming that this is a realistic thing that I expect to happen, so it's not coming from my "complete mental universe"; I'm just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.

That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).

From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say "maybe X happens", or "X is not absurd", I'm saying that my probability distribution assigns non-trivial probability to futures in which X happens. Notably, this is absolutely "coming from my complete mental universe" -- the probability distribution is all there is, there's no extra constraints that take 5% probabilities and drive them down to 0, or whatever else you might imagine would give you a "mainline".

As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)

But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal to which to aspire. What makes you expect that? It's certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over. In that setting "talk about the mainline" sounds like "keep just one hypothesis and talk about what it says"; this is not going to give you good results [LW · GW]. Maybe more charitably it's "one hypothesis is going to stably get >50% probability and so you should think about that hypothesis a lot" but I don't see why that should be true.

Obviously some things do in fact get > 90% probability; if you ask me questions like "what's the probability that if it rains the sidewalk will be wet" I will totally have a mainline, and there will be edge cases like "what if the rain stopped at the boundary between the sidewalk and the road" but those will be mostly irrelevant. The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.


I presume that there is an underlying unvoiced argument that goes "Rohin, you say that you have a probability distribution over futures; that implies that you have many, many different consistent worlds in mind, and you are uncertain about which one we're in, and when you are asked for the probability of X then you sum probabilities across each of the worlds where X holds. This seems wild; it's such a ridiculously complicated operation for a puny human brain to implement; there's no way you're doing this. You're probably just implementing some simpler heuristic where you look at some simple surface desideratum and go 'idk, 30%' out of modesty."

Obviously I do not literally perform the operation described above, like any bounded agent I have to approximate the ideal. But I do not then give up and say "okay, I'll just think about a single consistent world and drop the rest of the distribution", I do my best to represent the full range of uncertainty, attempting to have all of my probabilities on events ground out in specific worlds that I think are plausible, think about some specific worlds in greater detail to see what sorts of correlations arise between different important phenomena, carry out some consistency checks on the probabilities I assign to events to notice cases where I'm clearly making mistakes, etc. I don't see why "have a mainline" is obviously a better response to our boundedness than the approach I use (if anything, it seems obviously a worse response).

Replies from: So8res, johnswentworth, Vaniver, Vaniver
comment by So8res · 2022-03-02T17:19:48.024Z · LW(p) · GW(p)

In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd)."

On my understanding of Eliezer's picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.

Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.

Replies from: habryka4
comment by habryka (habryka4) · 2022-03-03T00:17:46.404Z · LW(p) · GW(p)

Relevant Feynman quote: 

I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.

For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.

Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.

comment by johnswentworth · 2022-03-02T18:18:09.672Z · LW(p) · GW(p)

As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)

But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal to which to aspire. What makes you expect that?

I'll try to explain the technique and why it's useful. I'll start with a non-probabilistic version of the idea, since it's a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.

Suppose I'm building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy - think energy conservation, or Newton's Laws, or market efficiency, depending on what kind of systems we're talking about. My hope/plan is to derive (i.e. prove) some predictions from these conditions, or maybe prove some of the conditions from others.

Before I go too far down the path of proving things from the conditions, I'd like to do a quick check that my conditions are consistent at all. How can I do that? Well, human brains are quite good at constrained optimization, so one useful technique is to look for one example of a system which satisfies all the conditions. If I can find one example, then I can be confident that the conditions are at least not inconsistent. And in practice, once I have that one example in hand, I can also use it for other purposes: I can usually see what (possibly unexpected) degrees of freedom the conditions leave open, or what (possibly unexpected) degrees of freedom the conditions don't leave open. By looking at that example, I can get a feel for the "directions" along which the conditions do/don't "lock in" the properties of the system.

(Note that in practice, we often start with an example to which we want our conditions to apply, and we choose the conditions accordingly. In that case, our one example is built in, although we do need to remember the unfortunately-often-overlooked step of actually checking what degrees of freedom the conditions do/don't leave open to the example.)

What would a probabilistic version of this look like? Well, we have a world model with some (uncertain) constraints in it - i.e. kinds-of-things-which-tend-to-happen, and kinds-of-things-which-tend-to-not-happen. Then, we look for an example which generally matches the kinds-of-things-which-tend-to-happen. If we can find such an example, then we know that the kinds-of-things-which-tend-to-happen are mutually compatible; a high probability for some of them does not imply a low probability for others. With that example in hand, we can also usually recognize which features of the example are very-nailed-down by the things-which-tend-to-happen, and which features have lots of freedom. We may, for instance, notice that there's some very-nailed-down property which seems unrealistic in the real world; I expect that to be the most common way for this technique to unearth problems.

That's the role a "mainline" prediction serves. Note that it does not imply the mainline has a high probability overall, nor does it imply a high probability that all of the things-which-tend-to-happen will necessarily occur simultaneously. It's checking whether the supposed kinds-of-things-which-tend-to-happen are mutually consistent with each other, and it provides some intuition for what degrees of freedom the kinds-of-things-which-tend-to-happen do/don't leave open.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T19:44:00.740Z · LW(p) · GW(p)

Man, I would not call the technique you described "mainline prediction". It also seems kinda inconsistent with Vaniver's usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.

Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about "mainline predictions" -- for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that's because (a) Eliezer wanted enough concreteness that I ended up talking about the stupidly inefficient version rather than the one I'd actually expect in the real world and (b) I was focused on demonstrating an existence proof for the technical properties, rather than also trying to include the social ones.)

Replies from: johnswentworth, Vaniver
comment by johnswentworth · 2022-03-02T19:49:23.022Z · LW(p) · GW(p)

To be clear, I do not mean to use the label "mainline prediction" for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track.

(Also, it matches up quite well with Nate's model based on his comment here [LW(p) · GW(p)], and I expect it also matches how Eliezer wants to use the technique.)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T20:00:41.377Z · LW(p) · GW(p)

Ah, got it. I agree that:

  1. The technique you described is in fact very useful
  2. If your probability distribution over futures happens to be such that it has a "mainline prediction", you get significant benefits from that (similar to the benefits you get from the technique you described).
comment by Vaniver · 2022-03-03T03:36:20.801Z · LW(p) · GW(p)

Man, I would not call the technique you described "mainline prediction". It also seems kinda inconsistent with Vaniver's usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.

Vaniver, is this what you meant?

Uh, I inherited "mainline" from Eliezer's usage in the dialogue, and am guessing that his reasoning is following a process sort of like mine and John's. My natural word for it is a 'particle', from particle filtering, as linked in various places, which I think is consistent with John's description. I'm further guessing that Eliezer's noticed more constraints / implied inconsistencies, and is somewhat better at figuring out which variables to drop, so that his cloud is narrower than mine / more generates 'mainline predictions' than 'probability distributions'.

If so, my new answer is that I and others do in fact talk about "mainline predictions" -- for me, there was that whole section talking about natural language debate as an alignment strategy.

Do you feel like you do this 'sometimes', or 'basically always'? Maybe it would be productive for me to reread the dialogue (or at least part of it) and sort sections / comments by how much they feel like they're coming from this vs. some other source. 

As a specific thing that I have in mind, I think there's a habit of thinking / discourse that philosophy trains, which is having separate senses for "views in consideration" and "what I believe", and thinking that statements should be considered against all views in consideration, even ones that you don't believe. This seems pretty good in some respects (if you begin by disbelieving a view incorrectly, your habits nevertheless gather you lots of evidence about it, which can cause you to then correctly believe it), and pretty questionable in other respects (conversations between Alice and Bob now have to include them shadowboxing with everyone else in the broader discourse, as Alice is asking herself "what would Carol say in response to that?" to things that Bob says to her).

When I imagine dialogues generated by people who are both sometimes doing the mainline thing and sometimes doing the 'represent the whole discourse' thing, they look pretty different from dialogues generated by people who are both only doing the mainline thing. [And also from dialogues generated by both people only doing the 'represent the whole discourse' thing, of course.]

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-03T07:35:19.365Z · LW(p) · GW(p)

Do you feel like you do this 'sometimes', or 'basically always'?

I don't know what "this" refers to. If the referent is "have a concrete example in mind", then I do that frequently but not always. I do it a ton when I'm not very knowledgeable and learning about a thing; I do it less as my mastery of a subject increases. (Examples: when I was initially learning addition, I used the concrete example of holding up three fingers and then counting up two more to compute 3 + 2 = 5, which I do not do any more. When I first learned recursion, I used to explicitly run through an execution trace to ensure my program would work, now I do not.)

If the referent is "make statements that reflect my beliefs", then it depends on context, but in the context of these dialogues, I'm always doing that. (Whereas when I'm writing for the newsletter, I'm more often trying to represent the whole discourse, though the "opinion" sections are still entirely my beliefs.)

comment by Vaniver · 2022-03-03T02:57:30.720Z · LW(p) · GW(p)

whatever else you might imagine would give you a "mainline".

As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)

I think this is roughly how I'm thinking about things sometimes, tho I'd describe the mainline as the particle with plurality weight (which is a weaker condition than >50%). [I don't know how Eliezer thinks about things; maybe it's like this? I'd be interested in hearing his description.]

I think this is also a generator of disagreements about what sort of things are worth betting on; when I imagine why I would bail with "the future is hard to predict", it's because the hypotheses/particles I'm considering have clearly defined X, Y, and Z variables (often discretized into bins or ranges) but not clearly defined A, B, and C variables (tho they might have distributions over those variables), because if you also conditioned on those you would have Too Many Particles. And when I imagine trying to contrast particles on features A, B, and C, as they all make weak predictions we get at most a few bits of evidence to update their weights on, whereas when we contrast them on X, Y, and Z we get many more bits, and so it feels more fruitful to reason about.

But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal to which to aspire. What makes you expect that? It's certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over.

I mean, the question is which direction we want to approach Bayesianism from, given that Bayesianism is impossible (as you point out later in your comment). On the one hand, you could focus on 'updating', and have lots of distributions that aren't grounded in reality but which are easy to massage when new observations come in, and on the other hand, you could focus on 'hypotheses', and have as many models of the situation as you can ground, and then have to do something much more complicated when new observations come in.

[Like, a thing I find helpful to think about here is where the motive power from Aumann's Agreement Theorem comes from, which is that when I say 40% A, you know that my private info is consistent with an update of the shared prior whose posterior is 40%, and when you take the shared prior and update on your private info and that my private info is consistent with 40% and your posterior is 60% A, then I update to 48% A, that's what happened when I further conditioned on knowing that your private info is consistent with that update, and so on. Like we both have to be manipulating functions on the whole shared prior for every update!]

For what it's worth, I think both styles are pretty useful in the appropriate context. [I am moderately confident this is a situation where it's worth doing the 'grounded-in-reality' particle-filtering approach, i.e. hitting the 'be concrete' and 'be specific' buttons over and over, and then once you've built out one hypothesis doing it again with new samples.]

The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.

I don't think I believe the 'should always have a mainline' thing, but I do think I want to defend the weaker claim of "it's worth having a mainline about this." Like, I think if you're starting a startup, it's really helpful to have a 'mainline plan' wherein the whole thing actually works, even if you ascribe basically no probability to it going 'exactly to plan'. Plans are useless, planning is indispensable.

[Also I think it's neat that there's a symmetry here about complaining about the uncertainty of the future, which makes sense if we're both trying to hold onto different pieces of Bayesianism while looking at the same problem.]

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-03T11:04:25.587Z · LW(p) · GW(p)

If you define "mainline" as "particle with plurality weight", then I think I was in fact "talking on my mainline" at some points during the conversation, and basically everywhere that I was talking about worlds (instead of specific technical points or intuition pumps) I was talking about "one of my top 10 particles".

I think I responded to every request for concreteness with a fairly concrete answer. Feel free to ask me for more concreteness in any particular story I told during the conversation.

comment by Vaniver · 2022-03-03T02:57:12.521Z · LW(p) · GW(p)

I'm just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.

Huh, I guess I don't believe the intuition pump? Like, as the first counterexample that comes to mind, when I imagine having an AGI where I can tell everything about how it's thinking, and yet I remain a black box to myself, I can't really tell whether or not it's aligned to me. (Is me-now the one that I want it to be aligned to, or me-across-time? Which side of my internal conflicts about A vs. B / which principle for resolving such conflicts?)

I can of course imagine a reasonable response to that from you--"ah, resolving philosophical difficulties is the user's problem, and not one of the things that I mean by alignment"--but I think I have some more-obviously-alignment-related counterexamples. [Tho if by 'infinite oversight ability' you do mean something like 'logical omniscience' it does become pretty difficult to find a real counterexample, in part because I can just find the future trajectory with highest expected utility and take the action I take at the start of that trajectory without having to have any sort of understanding about why that action was predictably a good idea.]

But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? "If we had an infinitely good engine, we could make the perfect car", which seems sensible when you're used to thinking of engine improvements linearly increasing car quality and doesn't seem sensible when you're used to thinking of car quality as a product of sigmoids of the input variables.

(This is a long response to a short section because I think the disagreement here is about something like "how should we reason and communicate about intuitions?", and so it's worth expanding on what I think might be the implications of otherwise minor disagreements.)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-03T11:16:08.257Z · LW(p) · GW(p)

I can of course imagine a reasonable response to that from you--"ah, resolving philosophical difficulties is the user's problem, and not one of the things that I mean by alignment"

That is in fact my response. (Though one of the ways in which the intuition pump isn't fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can't correctly predict the consequences of running that program for a long time. Still feels like they'd do fine.)

I do agree that if you go as far as "logical omniscience" then there are "cheating" ways of solving the problem that don't really tell us much about how hard alignment is in practice.

But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? "If we had an infinitely good engine, we could make the perfect car", which seems sensible when you're used to thinking of engine improvements linearly increasing car quality and doesn't seem sensible when you're used to thinking of car quality as a product of sigmoids of the input variables.

The car analogy just doesn't seem sensible. I can tell stories of car doom even if you have infinitely good engines (e.g. the steering breaks). My point is that we struggle to tell stories of doom when imagining a very powerful oversight process that knows everything the model knows.

I'm not thinking "more oversight quality --> more alignment" and then concluding "infinite oversight quality --> alignment solved". I'm starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding "infinite oversight quality --> alignment solved". So I don't think this has much to do with extrapolating tangents vs. production functions, except inasmuch as production functions encourage you to think about complements to your inputs that you can then posit don't exist in order to tell a story of doom.

Replies from: Vaniver
comment by Vaniver · 2022-03-04T06:17:19.194Z · LW(p) · GW(p)

I'm starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding "infinite oversight quality --> alignment solved".

I think some of my more alignment-flavored counterexamples look like:

  • The 'reengineer it to be safe' step breaks down / isn't implemented thru oversight. Like, if we're positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it's probably not going to be competitive!
  • The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to [LW · GW], but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
  • Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the 'reengineer to make it more safe' plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is 'good enough', for them at least.

That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don't proceed on EV of known information, but actually do the 'due diligence', and 3) sufficient coordination among humans that we don't leave behind substantial swaths of current human preferences, and I don't see how we get those thru having arbitrary transparency.

[I also would like to solve the problem of "AI has good outcomes" instead of the smaller problem of "AI isn't out to get us", because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-06T13:20:52.969Z · LW(p) · GW(p)

I obviously do not think this is at all competitive, and I also wanted to ignore the "other people steal your code" case. I am confused what you think I was trying to do with that intuition pump.

I guess I said "powerful oversight would solve alignment" which could be construed to mean that powerful oversight => great future, in which case I'd change it to "powerful oversight would deal with the particular technical problems that we call outer and inner alignment", but was it really so non-obvious that I was talking about the technical problems?

Maybe your point is that there are lots of things required for a good future, just as a car needs both steering and an engine, and so the intuition pump is not interesting because it doesn't talk about all the things needed for a good future? If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.

where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.

This just doesn't seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn't we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn't seem necessary?

Replies from: Vaniver
comment by Vaniver · 2022-03-06T16:05:55.992Z · LW(p) · GW(p)

I am confused what you think I was trying to do with that intuition pump.

I think I'm confused about the intuition pump too! Like, here's some options I thought up:

  • The 'alignment problem' is really the 'not enough oversight' problem. [But then if we solve the 'enough oversight' problem, we still have to solve the 'what we want' problem, the 'coordination' problem, the 'construct competitively' problem, etc.]
  • Bits of the alignment problem can be traded off against each other, most obviously coordination and 'alignment tax' (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of 'competitiveness', which I didn't want to use here for ease-of-understanding-by-newbies reasons.) [But it's basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you're also optimizing for finding holes in your transparency regime.]

Like, by analogy, I could imagine someone who uses an intuition pump of "if you had sufficient money, you could solve any problem", but I wouldn't use that intuition pump because I don't believe it. [Sure, 'by definition' if the amount of money doesn't solve the problem, it's not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy [LW · GW]?]

(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)

in which case I'd change it to "powerful oversight would deal with the particular technical problems that we call outer and inner alignment", but was it really so non-obvious that I was talking about the technical problems

I both 1) didn't think it was obvious (sorry if I'm being slow on following the change in usage of 'alignment' here) and 2) don't think realistically powerful oversight solves either of those two on its own (outer alignment because of "rejection sampling can get you siren worlds" problem, inner alignment because "rejection sampling isn't competitive", but I find that one not very compelling and suspect I'll eventually develop a better objection). 

[EDIT: I note that I also might be doing another unfavorable assumption here, where I'm assuming "unlimited oversight capacity" is something like "perfect transparency", and so we might not choose to spend all of our oversight capacity, but you might be including things here like "actually it takes no time to understand what the model is doing" or "the oversight capacity is of humans too," which I think weakens the outer alignment objection pretty substantially.]

If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.

Cool! I'm glad we agree on that, and will try to do more "did you mean limited statement X that we more agree about?" in the future.

This just doesn't seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn't we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn't seem necessary?

It came from where we decided to look. While I think it's possible you can have an AI out to deceive us, by putting information we want to see where we're going to look and information we don't want to see where we're not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: "Will the AI cure cancer? Yes? Cool, press the button." instead of "Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let's take a look at that."

Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don't intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-07T13:55:27.103Z · LW(p) · GW(p)

I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn't make sense outside of that context.

(The mentality is "it doesn't matter what oversight process you use, there's always a malicious superintelligence that can game it, therefore everyone dies".)

comment by Vaniver · 2022-03-01T20:58:12.540Z · LW(p) · GW(p)

The most recent post [? · GW] has a related exchange between Eliezer and Rohin:

Eliezer: I think the critical insight - though it has a format that basically nobody except me ever visibly invokes in those terms, and I worry maybe it can only be taught by a kind of life experience that's very hard to obtain - is the realization that any consistent reasonable story about underlying mechanisms will give you less optimistic forecasts than the ones you get by freely combining surface desiderata

Rohin: Yeah, I think I do not in fact understand why that is true for any consistent reasonable story.

If I'm being locally nitpicky, I argue that Eliezer's thing is a very mild overstatement (it should be "≤" instead of "<") but given that we're talking about forecasts, we're talking about uncertainty, and so we should expect "less" optimism instead of just "not more" optimism, and so I think Eliezer's statement stands as a general principle about engineering design.

This also feels to me like the sort of thing that I somehow want to direct attention towards. Either this principle is right and relevant (and it would be good for the field if all the AI safety thinkers held it!), or there's some deep confusion of mine that I'd like cleared up.

Replies from: a.mensch, rohinmshah
comment by A. Mensch (a.mensch) · 2022-03-02T21:02:31.209Z · LW(p) · GW(p)

Question to Eliezer: would you agree with the gist of the following? And if not, any thoughts on what lead to a strong sense of 'coherence in your worldview' as Vaniver put it?

Vaniver, I feel like you're pointing at something that I've noticed as well and am interested in too (the coherence of Eliezer's worldview as you put it). I wonder if has something to do with not going to uni but building his whole worldview all by him self. In my experience uni often tends towards to cramming lots of facts which are easily testable on exams, with less emphasis on understanding underlying principles [LW · GW] (which is harder to test with multiple choice questions). Personally I feel like I had to spend my years after uni trying to make sense, a coherent whole if you like, of all the separate things I've learned while in uni where things were mostly just kind of put out there without constantly integrating things. Perhaps if you start out thinking much more about underlying principles earlier on it's easier to integrate all the separate facts into a coherent whole as you go along. Not sure if Eliezer would agree with this. Maybe it's even much more basic and he just always had a very strong sense of dissatisfaction if he couldn't make things cohere into a whole and this urge for things to make sense was much more important than self studying or thinking about underlying principles before and then during the learning of new knowledge...

I would like to point out a section in the latest Shay/Yudkowsky dialogue where Eliezer says some things about this topic, does this feel like it's the same thing you are talking about Vaniver?

Eliezer: "So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am.

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all.  Richard Feynman - or so I would now say in retrospect - is noticing concreteness dying out of the world, and being worried about that, at the point where he goes to a college and hears a professor talking about "essential objects" in class, and Feynman asks "Is a brick an essential object?" - meaning to work up to the notion of the inside of a brick, which can't be observed because breaking a brick in half just gives you two new exterior surfaces - and everybody in the classroom has a different notion of what it would mean for a brick to be an essential object. 

Richard Feynman knew to try plugging in bricks as a special case, but the people in the classroom didn't, and I think the mental motion has died out of the world even further since Feynman wrote about it.  The loss has spread to STEM as well.  Though if you don't read old books and papers and contrast them to new books and papers, you wouldn't see it, and maybe most of the people who'll eventually read this will have no idea what I'm talking about because they've never seen it any other way...

I have a thesis about how optimism over AGI works.  It goes like this: People use really abstract descriptions and never imagine anything sufficiently concrete, and this lets the abstract properties waver around ambiguously and inconsistently to give the desired final conclusions of the argument.  So MIRI is the only voice that gives concrete examples and also by far the most pessimistic voice; if you go around fully specifying things, you can see that what gives you a good property in one place gives you a bad property someplace else, you see that you can't get all the properties you want simultaneously."

comment by Rohin Shah (rohinmshah) · 2022-03-02T13:15:17.341Z · LW(p) · GW(p)

Note that my first response was:

(For the reader, I don't think that "arguments about what you're selecting for" is the same thing as "freely combining surface desiderata", though I do expect they look approximately the same to Eliezer)

and my immediately preceding message was

I actually think something like this might be a crux for me, though obviously I wouldn't put it the way you're putting it. More like "are arguments about internal mechanisms more or less trustworthy than arguments about what you're selecting for" (limiting to arguments we actually have access to, of course in the limit of perfect knowledge internal mechanisms beats selection). But that is I think a discussion for another day.

I think I was responding to the version of the argument where "freely combining surface desiderata" was swapped out with "arguments about what you're selecting for". I probably should have noted that I agreed with the basic abstract point as Eliezer stated it; I just don't think it's very relevant to the actual disagreement.

I think my complaints in the context of the discussion are:

  • It's a very weak statement. If you freely combine the most optimistic surface desiderata, you get ~0% chance of doom. My estimate is way higher (in odds-space) than ~0%, and the statement "p(doom) >= ~0%" is not that interesting and not a justification of "doom is near-inevitable".
  • Relatedly, I am not just "freely combining surface desiderata". I am doing something like "predicting what properties AI systems would have by reasoning about what properties we selected for during training". I think you could reasonably ask how that compares against "predicting what properties AI systems would have by reasoning about what mechanistic algorithms could produce the behavior we observed during training". I was under the impression that this was what Eliezer was pointing at (because that's how I framed it in the message immediately prior to the one you quoted) but I'm less confident of that now.
Replies from: Vaniver
comment by Vaniver · 2022-03-02T14:36:17.223Z · LW(p) · GW(p)

Sorry, I probably should have been more clear about the "this is a quote from a longer dialogue, the missing context is important." I do think that the disagreement about "how relevant is this to 'actual disagreement'?" is basically the live thing, not whether or not you agree with the basic abstract point.

My current sense is that you're right that the thing you're doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that you have too many free parameters (even if the number of free parameters is two instead of arbitrarily large). I think arguments about what you're selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.

Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there's 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of one specific counterexample.

To use an example that makes me look bad, there was a time when I didn't believe Arrow's Impossibility Theorem because I was using the 'freely combine surface desiderata' strategy. The comment that snapped me out of it [LW(p) · GW(p)] involved having to actually write out the whole voting rule, and see that I couldn't instantiate the thing I thought I could instantiate.

As a more AI-flavored example, I was talking last night with Alex [LW · GW] about ELK, specifically trying to estimate the relative population of honest reporters and dishonest reporters in the prior implied by the neural tangent kernel model, and he observed that if you had a constructive approach of generating initializations that only contained honest reporters, that might basically solve the ELK problem; after thinking about it for a bit I said "huh, that seems right but I'm not sure it's possible to do that, because maybe any way to compose an honest reporter out of parts gives you all of the parts you need to compose a dishonest reporter."

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T17:03:57.057Z · LW(p) · GW(p)

Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there's 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of one specific counterexample.

I agree that if you have a choice about whether to have more or fewer free parameters, all else equal you should prefer the model with fewer free parameters. (Obviously, all else is not equal; in particular I do not think that Eliezer's model is tracking reality as well as mine.)

When Alice uses a model with more free parameters, you need to posit a bias before you can predict a systematic direction in which Alice will make mistakes. So this only bites you if you have a bias towards optimism. I know Eliezer thinks I have such a bias. I disagree with him.

I think arguments about what you're selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.

I agree that this is true in some platonic sense. Either the argument gives me a correct answer, in which case I have true statements that could be cashed out in terms of mechanistic algorithms, or the argument gives me a wrong answer, in which case it wouldn't be derivable from mechanistic algorithms, because the mechanistic algorithms are the "ground truth". 

Quoting myself from the dialogue:

(limiting to arguments we actually have access to, of course in the limit of perfect knowledge internal mechanisms beats selection)

Replies from: Vaniver
comment by Vaniver · 2022-03-03T02:09:41.997Z · LW(p) · GW(p)

When Alice uses a model with more free parameters, you need to posit a bias before you can predict a systematic direction in which Alice will make mistakes. So this only bites you if you have a bias towards optimism.

That is, when I give Optimistic Alice fewer constraints, she can more easily imagine a solution, and when I give Pessimistic Bob fewer constraints, he can more easily imagine that no solution is possible? I think... this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math. Like, the way Bob fails to find a solution mostly looks like "not actually considering the space", or "wasting consideration on easily-known-bad parts of the space", and more constraints could help with both of those. But, as math, removing constraints can't lower the volume of the implied space and so can't make it less likely that a viable solution exists.

I know Eliezer thinks I have such a bias. I disagree with him.

I think Eliezer thinks nearly all humans have such a bias by default, and so without clear evidence to the contrary it's a reasonable suspicion for anyone.

[I think there's a thing Eliezer does a lot, which I have mixed feelings about, which is matching people's statements to patterns and then responding to the generator of the pattern in Eliezer's head, which only sometimes corresponds to the generator in the other person's head.]

I agree that this is true in some platonic sense.

Cool, makes sense. [I continue to think we disagree about how true this is in a practical sense, where I read you as thinking "yeah, this is a minor consideration, we have to think with the tools we have access to, which could be wrong in either direction and so are useful as a point estimate" and me as thinking "huh, this really seems like the tools we have access to are going to give us overly optimistic answers, and we should focus more on how to get tools that will give us more robust answers."]

Replies from: Raemon, rohinmshah
comment by Raemon · 2022-03-03T02:37:21.139Z · LW(p) · GW(p)

[I think there's a thing Eliezer does a lot, which I have mixed feelings about, which is matching people's statements to patterns and then responding to the generator of the pattern in Eliezer's head, which only sometimes corresponds to the generator in the other person's head.]

I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They'd go around telling me "Ray, you're exhibiting that bias right now. Whatever rationalization you're coming up with right now, it's not the real reason you're arguing X." And I was like "c'mon man. I have a ton of introspective access to myself and I can tell that this 'rationalization' is actually a pretty good reason to believe X and I trust that my reasoning process is real."

But... eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on "is Ray displaying rational thought?". When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on "does Ray seem biased in this particular way?".

And both checks totally returned 'true', and that was an accurate assessment. 

The particular moment where I noticed this metapattern, I'd say my cognition was, say, 65% "good argumentation", 15% "one particular bias", "20% other random stuff." On a different day, it might have been that I was 65% exhibiting the bias and 15%.

None of this is making much claim of what's likely to be going on in Rohin's head or Eliezer's head or whether Eliezer's conversational pattern is useful, but wanted to flag it as a way people could be talking past each other.

comment by Rohin Shah (rohinmshah) · 2022-03-03T07:19:09.122Z · LW(p) · GW(p)

I think... this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math.

I think we're imagining different toy mathematical models.

Your model, according to me:

  1. There is a space of possible approaches, that we are searching over to find a solution. (E.g. the space of all possible programs.)
  2. We put a layer of abstraction on top of this space, characterizing approaches by N different "features" (e.g. "is it goal-directed", "is it an oracle", "is it capable of destroying the world")
  3. Because we're bounded agents, we then treat the features as independent, and search for some combination of features that would comprise a solution.

I agree that this procedure has a systematic error in claiming that there is a solution when none exists (and doesn't have the opposite error), and that if this were an accurate model of how I was reasoning I should be way more worried about correcting for that problem.

My model:

  1. There is a probability distribution over "ways the world could be".
  2. We put a layer of abstraction on top of this space, characterizing "ways the world could be" by N different "features" (e.g. "can you get human-level intelligence out of a pile of heuristics", "what are the returns to specialization", "how different will AI ontologies be from human ontologies"). We estimate the marginal probability of each of those features.
  3. Because we're bounded agents, when we need the joint probability of two or more features, we treat them as independent and just multiply.
  4. Given a proposed solution, we estimate its probability of working by identifying which features need to be true of the world for the solution to work, and then estimate the probability of those features (using the method above).

I claim that this procedure doesn't have a systematic error in the direction of optimism (at least until you add some additional details), and that this procedure more accurately reflects the sort of reasoning that I am doing.

Replies from: Vaniver
comment by Vaniver · 2022-03-04T06:03:37.235Z · LW(p) · GW(p)

Huh, why doesn't that procedure have that systematic error?

Like, when I try to naively run your steps 1-4 on "probability of there existing a number that's both even and odd", I get that about 25% of numbers should be both even and odd, so it seems pretty likely that it'll work out given that there are at least 4 numbers. But I can't easily construct an argument at a similar level of sophistication that gives me an underestimate. [Like, "probability of there existing a number that's both odd and prime" gives the wrong conclusion if you buy that the probability that a natural number is prime is 0, but this is because you evaluated your limits in the wrong order, not because of a problem with dropping all the covariance data from your joint distribution.]

My first guess is that you think I'm doing the "ways the world could be" thing wrong--like, I'm looking at predicates over numbers and trying to evaluate a predicate over all numbers, but instead I should just have a probability on "universe contains a number that is both even and odd" and its complement, as those are the two relevant ways the world can be. 

My second guess is that you've got a different distribution over target predicates; like, we can just take the complement of my overestimate ("probability of there existing no numbers that are both even and odd") and call it an underestimate. But I think I'm more interested in 'overestimating existence' than 'underestimating non-existence'. [Is this an example of the 'additional details' you're talking about?]

Also maybe you can just exhibit a simple example that has an underestimate, and then we need to think harder about how likely overestimates and underestimates are to see if there's a net bias.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-06T13:57:20.231Z · LW(p) · GW(p)

It's the first guess.

I think if you have a particular number then I'm like "yup, it's fair to notice that we overestimate the probability that x is even and odd by saying it's 25%", and then I'd say "notice that we underestimate the probability that x is even and divisible by 4 by saying it's 12.5%".

I agree that if you estimate a probability, and then "perform search" / "optimize" / "run n copies of the estimate" (so that you estimate the probability as 1 - (1 - P(event))^n), then you're going to have systematic errors.

I don't think I'm doing anything that's analogous to that. I definitely don't go around thinking "well, it seems 10% likely that such and such feature of the world holds, and so each alignment scheme I think of that depends on this feature has a 10% chance of working, therefore if I think of 10 alignment schemes I've solved the problem". (I suspect this is not the sort of mistake you imagine me doing but I don't think I know what you do imagine me doing.)

Replies from: Vaniver
comment by Vaniver · 2022-03-06T15:39:20.058Z · LW(p) · GW(p)

I'd say "notice that we underestimate the probability that x is even and divisible by 4 by saying it's 12.5%".

Cool, I like this example.

I agree that if you estimate a probability, and then "perform search" / "optimize" / "run n copies of the estimate" (so that you estimate the probability as 1 - (1 - P(event))^n), then you're going to have systematic errors.
...
I suspect this is not the sort of mistake you imagine me doing but I don't think I know what you do imagine me doing.

I think the thing I'm interested in is "what are our estimates of the output of search processes?". The question we're ultimately trying to answer with a model here is something like "are humans, when they consider a problem that could have attempted solutions of many different forms, overly optimistic about how solvable those problems are because they hypothesize a solution with inconsistent features?"

The example of "a number divisible by 2 and a number divisible by 4" is an example of where the consistency of your solution helps you--anything that satisfies the second condition is already satisfying the first condition. But importantly the best you can do here is ignore superfluous conditions; they can't increase the volume of the solution space. I think this is where the systematic bias is coming from (that the joint probability of two conditions can't be higher than the maximum of those two conditions, where the joint probability can be lower than the minimum of the two, and so the product isn't an unbiased estimator of the joint).

 

For example, consider this recent analysis of cultured meat, which seems to me to point out a fundamental inconsistency of this type in people's plans for creating cultured meat. Basically, the bigger you make a bioreactor, the better it looks on criteria ABC, and the smaller you make a bioreactor, the better it looks on criteria DEF, and projections seem to suggest that massive progress will be made on all of those criteria simultaneously because progress can be made on them individually. But this necessitates making bioreactors that are simultaneously much bigger and much smaller!

[Sometimes this is possible, because actually one is based on volume and the other is based on surface area, and so when you make something like a zeolite you can combine massive surface area with tiny volume. But if you need massive volume and tiny surface area, that's not possible. Anyway, in this case, my read is that both of these are based off of volume, and so there's no clever technique like that available.]

 

Maybe you could step me thru how your procedure works for estimating the viability of cultured meat, or the possibility of constructing a room temperature <10 atm superconductor, or something? 

It seems to me like there's a version of your procedure which, like, considers all of the different possible factory designs, applies some functions to determine the high-level features of those designs (like profitability, amount of platinum they consume, etc.), and then when we want to know "is there a profitable cultured meat factory?" responds with "conditioning on profitability > 0, this is the set of possible designs." And then when I ask "is there a profitable cultured meat factory using less than 1% of the platinum available on Earth?" says "sorry, that query is too difficult; I calculated the set of possible designs conditioned on profitability, calculated the set of possible designs conditioned on using less than 1% of the platinum available on Earth, and then <multiplied sets together> to give you this approximate answer."

But of course that's not what you're doing, because the boundedness prevents you from considering all the different possible factory designs. So instead you have, like, clusters of factory designs in your map? But what are those objects, and how do they work, and why don't they have the problem of not noticing inconsistencies because they didn't fully populate the details? [Or if they did fully populate the details for some limited number of considered objects, how do you back out the implied probability distribution over the non-considered objects in a way that isn't subject to this?]

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-07T13:38:59.028Z · LW(p) · GW(p)

Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn't know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.

To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understanding of causality, compression of the giant action space, etc. Everyone looked at this giant list of necessary features and thought "it's highly improbable for an algorithm to demonstrate all of these features". My understanding is that even OpenAI, the most optimistic of everyone, thought they would need to do some sort of hierarchical RL to get this to work. In the end, it turned out that vanilla PPO with reward shaping and domain randomization was enough. It turns out that all of these many different capabilities / features were very consistent with each other and easier to achieve simultaneously than we thought.

so the product isn't an unbiased estimator of the joint

Tbc, I don't want to claim "unbiased estimator" in the mathematical sense of the phrase. To even make such a claim you need to choose some underlying probability distribution which gives rise to our features, which we don't have. I'm more saying that the direction of the bias depends on whether your features are positively vs. negatively correlated with each other and so a priori I don't expect the bias to be in a predictable direction.

But what are those objects, and how do they work, and why don't they have the problem of not noticing inconsistencies because they didn't fully populate the details?

They definitely have that problem. I'm not sure how you don't have that problem; you're always going to have some amount of abstraction and some amount of inconsistency; the future is hard to predict for bounded humans, and you can't "fully populate the details" as an embedded agent.

If you're asking how you notice any inconsistencies at all (rather than all of the inconsistences), then my answer is that you do in fact try to populate details sometimes, and that can demonstrate inconsistencies (and consistencies).

I can sketch out a more concrete, imagined-in-hindsight-and-therefore-false story of what's happening.

Most of the "objects" are questions about the future to which there are multiple possible answers, which you have a probability distribution over (you can think of this as a factor in a Finite Factored Set, with an associated probability distribution over the answers). For example, you could imagine a question for "number of AGI orgs with a shot at time X", "fraction of people who agree alignment is a problem", "amount of optimization pressure needed to avoid deception", etc. If you provide answers to some subset of questions, that gives you an incomplete possible world (which you could imagine as an implicitly-represented set of possible worlds if you want). Given an incomplete possible world, to answer a new question quickly you reason abstractly from the answers you are conditioning on to get an answer to the new question.

When you have lots of time, you can improve your reasoning in many different ways:

  1. You can find other factors that seem important, add them in, subdividing worlds out even further. 
  2. You can take two factors, and think about how compatible they are with each other, building intuitions about their joint (rather than just their marginal probabilities, which is what you have by default).
  3. You can take some incomplete possible world, sketch out lots of additional concrete details, and see if you can spot inconsistencies.
  4. You can refactor your "main factors" to be more independent of each other. For example, maybe you notice that all of your reasoning about things like "<metric> at time X" depends a lot on timelines, and so you instead replace them with factors like "<metric> at X years before crunch time", where they are more independent of timelines.
comment by Richard_Ngo (ricraz) · 2022-03-03T14:23:24.552Z · LW(p) · GW(p)

When I read other people, I often feel like they're operating in a 'narrower segment of their model', or not trying to fit the whole world at once, or something. They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'

To me it seems like this is what you should expect other people to look like both when other people know less about a domain than you do, and also when you're overconfident about your understanding of that domain. So I don't think it helps distinguish those two cases.

(Also, to me it seems like a similar thing happens, but with the positions reversed, when Paul and Eliezer try to forecast concrete progress in ML over the next decade. Does that seem right to you?)

when Eliezer responded with:

But there's a really really basic lesson here about the different style of "sentences found in political history books" rather than "sentences produced by people imagining ways future politics could handle an issue successfully".

the subject got changed.

I believe this was discussed further at some point - I argued that Eliezer-style political history books also exclude statements like "and then we survived the cold war" or "most countries still don't have nuclear energy".
 

Replies from: Vaniver
comment by Vaniver · 2022-03-04T05:51:24.031Z · LW(p) · GW(p)

Also, to me it seems like a similar thing happens, but with the positions reversed, when Paul and Eliezer try to forecast concrete progress in ML over the next decade. Does that seem right to you?

It feels similar but clearly distinct? Like, in that situation Eliezer often seems to say things that I parse as "I don't have any special knowledge here", which seems like a different thing than "I can't easily sample from my distribution over how things go right", and I also have the sense of Paul being willing to 'go specific' and Eliezer not being willing to 'go specific'.

I believe this was discussed further at some point - I argued that Eliezer-style political history books also exclude statements like "and then we survived the cold war" or "most countries still don't have nuclear energy".

You're thinking of this bit of the conversation [LW · GW], starting with:

[Ngo][18:13]  

I think I'm a little cautious about this line of discussion, because my model doesn't strongly constrain the ways that different groups respond to increasing developments in AI. The main thing I'm confident about is that there will be much clearer responses available to us once we have a better picture of AI development.

(Or maybe a bit earlier and later, but that was my best guess for where to start the context.)

The main quotes from the middle that seems relevant:

[Ngo][18:19, moved two down in log]  

(As a side note, I think that if Eliezer had been around in the 1930s, and you described to him what actually happened with nukes over the next 80 years, he would have called that "insanely optimistic".)

[Yudkowsky][18:21]  

Mmmmmmaybe.  Do note that I tend to be more optimistic than the average human about, say, global warming, or everything in transhumanism outside of AGI.

Nukes have going for them that, in fact, nobody has an incentive to start a global thermonuclear war.  Eliezer is not in fact pessimistic about everything and views his AGI pessimism as generalizing to very few other things, which are not, in fact, as bad as AGI.

[Yudkowsky][18:22]  

But yeah, compared to pre-1946 history, nukes actually kind of did go really surprisingly well!

Like, this planet used to be a huge warring snakepit of Great Powers and Little Powers and then nukes came along and people actually got serious and decided to stop having the largest wars they could fuel.

and ending with:

[Yudkowsky][18:38]  

And Eliezer is capable of being less concerned about things when they are intrinsically less concerning, which is why my history does not, unlike some others in this field, involve me running also being Terribly Concerned about nuclear war, global warming, biotech, and killer drones.

Rereading that section, my sense is that it reads like a sort of mirror of the Eliezer->Paul "I don't know how to operate your view" section; like, Eliezer can say "I think nukes are less worrying for reasons ABC, also you can observe me being not worried about other things-people-are-concerned-by XYZ", but I wouldn't have expected you (or the reader who hasn't picked up Eliezer-thinking from elsewhere) to have been able to come away from that with why you trying to be Eliezer from 1930s would have thought 'and then it turned out okay' would have been a political-history-book-sentence, or the relative magnitudes of the surprise. [Like, I think my 1930s-Eliezer puts like 3-30% on "and then it turned out okay" for nukes, and my 2020s-Eliezer puts like 0.03-3% on that for AGI? But it'd be nice to hear if Eliezer thinks AGI turning out as well as nukes is like 10x the surprise of nukes turning out this well conditioned on pre-1930s, or more like 1000x the surprise.]

comment by dxu · 2022-03-01T20:56:51.177Z · LW(p) · GW(p)

This is a very interesting point! I will chip in by pointing out a very similar remark from Rohin just earlier today [LW(p) · GW(p)]:

And I'll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.

That is all.

(Obviously there's a kinda superficial resemblance here to the phenomenon of "calling out" somebody else; I want to state outright that this is not the intention, it's just that I saw your comment right after seeing Rohin's comment, in such a way that my memory of his remark was still salient enough that the connection jumped out at me. Since salient observations tend to fade over time, I wanted to put this down before that happened.)

Replies from: Vaniver, rohinmshah
comment by Vaniver · 2022-03-01T21:22:21.441Z · LW(p) · GW(p)

Yeah, I'm also interested in the question of "how do we distinguish 'sentences-on-mainline' from 'shoring-up-edge-cases'?", or which conversational moves most develop shared knowledge, or something similar. 

Like I think it's often good to point out edge cases, especially when you're trying to formalize an argument or look for designs that get us out of this trap. In another comment in this thread, I note that there's a thing Eliezer said that I think is very important and accurate, and also think there's an edge case that's not obviously handled correctly. 

But also my sense is that there's some deep benefit from "having mainlines" and conversations that are mostly 'sentences-on-mainline'? Or, like, there's some value to more people thinking thru / shooting down their own edge cases (like I do in the mentioned comment), instead of pushing the work to Eliezer. I'm pretty worried that there are deeply general reasons to expect AI alignment to be extremely difficult, people aren't updating on the meta-level point and continue to attempt 'rolling their own crypto', asking if Eliezer can poke the hole in this new procedure, and if Eliezer ever decides to just write serial online fiction until the world explodes humanity hasn't developed enough capacity to replace him.

Replies from: rohinmshah, RobbBB
comment by Rohin Shah (rohinmshah) · 2022-03-02T20:21:15.057Z · LW(p) · GW(p)

(For object-level responses, see comments on parallel threads.)

I want to push back on an implicit framing in lines like:

there's some value to more people thinking thru / shooting down their own edge cases [...], instead of pushing the work to Eliezer.

people aren't updating on the meta-level point and continue to attempt 'rolling their own crypto', asking if Eliezer can poke the hole in this new procedure

This makes it sound like the rest of us don't try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then not update that maybe future proposals will have problems.

Whereas in reality, I try to break my proposals, don't agree with Eliezer's diagnoses of the problems, and usually don't ask Eliezer because I don't expect his answer to be useful to me (and previously didn't expect him to respond). I expect this is true of others (like Paul and Richard) as well.

Replies from: Vaniver
comment by Vaniver · 2022-03-03T04:07:09.749Z · LW(p) · GW(p)

Yeah, sorry about not owning that more, and for the frame being muddled. I don't endorse the "asking Eliezer" or "agreeing with Eliezer" bits, but I do basically think he's right about many object-level problems he identifies (and thus people disagreeing with him about that is not a feature) and think 'security mindset' is the right orientation to have towards AGI alignment. That hypothesis is a 'worry' primarily because asymmetric costs means it's more worth investigating than the raw probability would suggest. [Tho the raw probability of components of it do feel pretty substantial to me.]

[EDIT: I should say I think ARC's approach to ELK seems like a great example of "people breaking their own proposals". As additional data to update on, I'd be interested in seeing, like, a graph of people's optimism about ELK over time, or something similar.]

comment by Rob Bensinger (RobbBB) · 2022-03-01T22:22:02.880Z · LW(p) · GW(p)

But also my sense is that there's some deep benefit from "having mainlines" and conversations that are mostly 'sentences-on-mainline'?

I agree with this. Or, if you feel ~evenly split between two options, have two mainlines and focus a bunch on those (including picking at cruxes and revising your mainline view over time).

But:

Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with 'since you're being overly pessimistic, I will be overly optimistic to balance', with no attempt to have his response match his own mainline.

I do note that there are some situations where rushing to tell a 'mainline story' might be the wrong move:

  • Maybe your beliefs feel wildly unstable day-to-day -- because you're learning a lot quickly, or because it's just hard to know how to assign weight to the dozens of different considerations that bear on these questions. Then trying to take a quick snapshot of your current view might feel beside the point.
    • It might even feel actively counterproductive, like rushing too quickly to impose meaning/structure on data when step one is to make sure you have the data properly loaded up in your head.
  • Maybe there are many scenarios that seem similarly likely to you. If you see ten very different ways things could go, each with ~10% subjective probability, then picking a 'mainline' may be hard, and may require a bunch of arbitrary-feeling choices about which similarities-between-scenarios you choose to pay attention to.
comment by Peter Wildeford (peter_hurford) · 2022-03-03T05:36:29.078Z · LW(p) · GW(p)

These conversations are great and I really admire the transparency. It's really nice to see discussions that normally happen in private happen instead in public where everyone can reflect, give feedback, and improve their own thoughts. On the other hand, the combined conversations combined to a decent-sized novel - LW says 198,846 words! Is anyone considering investing heavily in summarizing the content for people to get involved without having to read all that content?

Replies from: RobbBB, Benito, daniel-kokotajlo
comment by Rob Bensinger (RobbBB) · 2022-03-03T19:53:39.165Z · LW(p) · GW(p)

Echoing that I loved these conversations and I'm super grateful to everyone who participated — especially Richard, Paul, Eliezer, Nate, Ajeya, Carl, Rohin, and Jaan, who contributed a lot.

I don't plan to try to summarize the discussions or distill key take-aways myself (other than the extremely cursory job I did on https://intelligence.org/late-2021-miri-conversations/), but I'm very keen on seeing others attempt that, especially as part of a process to figure out their own models and do some evaluative work.

I think I'd rather see partial summaries/responses that go deep, instead of a more exhaustive but shallow summary; and I'd rather see summaries that center the author's own view (what's your personal take-away? what are your objections? which things were small versus large updates? etc.) over something that tries to be maximally objective and impersonal. But all the options seem good to me.

comment by Ben Pace (Benito) · 2022-03-03T07:31:25.147Z · LW(p) · GW(p)

I chatted briefly the other day with Rob Bensinger about me turning them into a little book. My guess is I'd want to do something to compress especially the long Paul/Eliezer bet hashing out, that felt super long to me and not all worth the reading.

Interested in other suggestions for compression.

(This is not a commitment to do this, I probably won't.)

Replies from: Kenoubi, Gyrodiot
comment by Kenoubi · 2022-03-22T16:54:10.282Z · LW(p) · GW(p)

I wish you (or someone) would make a little book of this.

comment by Gyrodiot · 2022-03-05T10:51:17.819Z · LW(p) · GW(p)

The compression idea evokes Kaj Sotala's summary/analysis of the AI-Foom Debate [? · GW] (which I found quite useful at the time). I support the idea, especially given it has taken a while for the participants to settle on things cruxy enough to discuss and so on. Though I would also be interested in "look, these two disagree on that, but look at all the very fundamental things about AI alignment they agree on".

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-03-06T15:01:38.576Z · LW(p) · GW(p)

Here is a heavily condensed summary of the takeoff speeds thread of the conversation, incorporating earlier points made by Hanson, Grace, etc. https://objection.lol/objection/3262835

:)

(kudos to Ben Goldhaber for pointing me to it)

comment by So8res · 2022-03-02T17:23:48.075Z · LW(p) · GW(p)

Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T18:42:37.872Z · LW(p) · GW(p)

I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.

(I haven't brought it up before because it seems to me like the disagreement is much more in the "mechanisms underlying intelligence", which that doc barely talks about, and the stuff it does say feels pretty outdated; I'd say different things now.)

Replies from: None
comment by [deleted] · 2022-03-02T20:51:16.550Z · LW(p) · GW(p)

If I didn't miss anything and I'm understanding the scenario correctly, then for this part:

At some point, we reach the level of interpretability where we are convinced that the evolved AI system is already aligned with us before even being finetuned on specific tasks, 

I'd expect that interpretability tools, if they work, would tell you "yup, this AI is planning to kill you as soon as it possibly can", without giving you a way to fix that (that's robust to capability gains). Ie this story still seems to rely on an unexplained step that goes "... and a miracle occurs where we fundamentally figure out how to align AI just in the nick of time".

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T21:23:45.810Z · LW(p) · GW(p)

Totally agreed that the doc does not address that argument. Quoting from my original comment:

the disagreement is much more in the "mechanisms underlying intelligence", which that doc barely talks about, and the stuff it does say feels pretty outdated; I'd say different things now.

comment by Ben Pace (Benito) · 2022-03-02T05:46:35.189Z · LW(p) · GW(p)

Eliezer and Nate, my guess is that most of your perspective on the alignment problem for the past several years has come from the thinking and explorations you've personally done, rather than reading work done by others.

But, if you have read interesting work by others that's changed your mind or given you helpful insights, what has it been? Some old CS textbook? Random Gwern articles? An economics textbook? Playing around yourself with ML systems?

comment by Donald Hobson (donald-hobson) · 2022-03-06T02:33:05.259Z · LW(p) · GW(p)

One thing in the posts I found surprising was Eliezers assertion that you needed a dangerous superintelligence to get nanotech. If the AI is expected to do everything itself, including inventing the concept of nanotech, I agree that this is dangerously superintelligent. 

However, suppose Alpha Quantum can reliably approximate the behaviour of almost any particle configuration. Not literally any, it can't run a quantum computer factorizing large numbers better than factoring algorithms, but enough to design a nanomachine. (It has been trained to approximate the ground truth of quantum mechanics equations, and it does this very well.) 

For example, you could use IDA, start training to imitate a simulation of a handful of particles, then compose several smaller nets into one large one. 

Add a nice user interface and we can drag and drop atoms. 

You can add optimization, gradient descent trying to maximize the efficiency of a motor, or minimize the size of a logic gate. All of this is optimised to fit a simple equation, so assuming you don't have smart general mesaoptimizers forming, and deducing how to manipulate humans based on very little info about humans, you should be safe. Again, designing a few nanogears by gradient descent techniques and shallow heuristics shouldn't be hard. You also want to make sure not to design a nanocomputer containing a UFAI, but a computer is fairly large and obvious. (Optimizing for the smallest logic gate won't produce a UFAI.)

 

If the humans want to make a nanocomputer, they download an existing chip schematic, and scale it down, replacing the logic gates with nanologic. 

The first physical hardware would be a minimal nanoassembler. Analogue signals going from macroscopic to nanoscopic. The nanoassembler is a robotic arm. All the control decisions, the digital to analogue conversion, that's all macroscopic. This is of course, all in lab conditions. Perhaps this is produced with a scanning tunnelling microscope. Perhaps carefully designed proteins.

Once you have this, it shouldn't be too hard to bootstrap to create anything you can design.

Basically I don't think it is too hard for humans to create nanotech with the use of some narrowish and dumb AI. And I am wondering if this changes the strategic picture at all?

Replies from: VojtaKovarik, Vaniver, Gram Stone
comment by VojtaKovarik · 2022-03-16T08:21:27.077Z · LW(p) · GW(p)

(Not very sure I understood your description right, but here is my take:)

  • I think your proposal is not explaining some crucial steps, which are in fact hard. In particular, I understood it as "you have AI which can give you blueprints for nano sized machines". But I think we already have some blueprints, this isn't an issue. How we assemble them is an issue.
  • I expect that there will be more issues like this that you would find if you tried writing the plan in more detail.

However, I share the general sentiment behind your post --- I also don't understand why you can't get some pivotal act by combining human intelligence with some narrow AI. I expect that Eliezer have tried to come up with such combinations and came away with some general takeaways on this being not realistic. But I haven't done this exercise, so it seems not obvious to me. Perhaps it would be beneficial if many more people tried doing the exercise and then communicated the takeaways.

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-06-01T07:20:18.871Z · LW(p) · GW(p)

Perhaps it would be beneficial if many more people tried doing the exercise and then communicated the takeaways.

I think it would be!

comment by Vaniver · 2022-03-06T14:58:12.109Z · LW(p) · GW(p)

they download an existing chip schematic, and scale it down

Uh, how big do you think contemporary chips are?

Replies from: donald-hobson
comment by Donald Hobson (donald-hobson) · 2022-03-06T18:05:37.944Z · LW(p) · GW(p)

Like 10s of atoms across. So you aren't scaling down that much. (Most of your performance gains are in being able to stack your chips or whatever.

comment by Gram Stone · 2022-03-06T15:55:45.567Z · LW(p) · GW(p)

I got the impression Eliezer's claiming that a dangerous superintelligence is merely sufficient for nanotech.

How would you save us with nanotech? It had better be good given all the hardware progress you just caused!

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-03-06T17:29:00.657Z · LW(p) · GW(p)

I got the impression Eliezer's claiming that a dangerous superintelligence is merely sufficient for nanotech.

No, I'm pretty confident Eliezer thinks AGI is both necessary and sufficient for nanotech. (Realistically/probabilistically speaking, given plausible levels of future investment into each tech. Obviously it's not logically necessary or sufficient.) Cf. my summary of Nate's view in Nate's reply to Joe Carlsmith [AF · GW]:

Nate agrees that if there's a sphexish way to build world-saving nanosystems, then this should immediately be the top priority, and would be the best way to save the world (that’s currently known to us). Nate doesn't predict that this is feasible, but it is on his list of the least-unlikely ways things could turn out well, out of the paths Nate can currently name in advance. (Most of Nate's hope for the future comes from some other surprise occurring that he hasn’t already thought of.)

(I read "sphexish" here as a special case of "narrow AI" / "shallow cognition", doing more things as a matter of pre-programmed reflex rather than as a matter of strategic choice.)

comment by Steven Byrnes (steve2152) · 2022-03-02T20:12:41.237Z · LW(p) · GW(p)

I wrote Consequentialism & Corrigibility [LW · GW] shortly after and partly in response to the first (Ngo-Yudkowsky) discussion [LW · GW]. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section [LW · GW]) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I'd be interested to hear it. Thanks in advance. :)

Replies from: dxu, awenonian
comment by dxu · 2022-03-04T03:11:31.173Z · LW(p) · GW(p)

After reading some of the newer MIRI dialogues, I'm less convinced than I once was that I know what "corrigibility" actually is. Could you say a few words about what kind of behavior you concretely expect to see from a "corrigible" agent, followed by how [you expect] those behaviors [to] fit into the "trajectory-constraining" framework you propose in your post?

EDIT: This is not purely a question for Steven, incidentally (or at least, the first half isn't); anyone else who wants to take a shot at answering should feel free to do so. In particular I'd be interested in hearing answers from Eliezer or anyone else historically involved in the invention of the term.

Replies from: Benito, Algon
comment by Ben Pace (Benito) · 2022-03-04T03:23:59.167Z · LW(p) · GW(p)

My understanding: a corrigible paperclip-maximizer does all the paperclip-maximizing, but then when you realize it's gonna end the world, you go to turn it off, and it doesn't stop you. It's corrigible!

comment by Algon · 2022-03-04T16:17:19.652Z · LW(p) · GW(p)

There are a bunch of different definitions, but if you're asking for Eliezer's version, then the arbital expoisition is quite good. N.B. we don't have a model for this sort of corrigibility. 

EDIT: Be warned, these are rough summaries of the defs. I'd ammend the CHAI def I cited to "the AI obeys more when it knows less, models you as more rational, and the downsides of disobedience are lesser". But people at CHAI have diverse views, so this is not the definitive CHAI take.

Other definitions include some people at CHAI's definition (the AI obeys you whilst it doesn't know what its utility function is), the definition used in the reward tampering paper (near the same as EY's original def, barring the honesty clause, and formalised in a causal diagram setting), Stuart Armstrong's many definitions which most notably includes Utility Indifference (note the agent is NOT a standard R-maximiser) so it accepts having its utility function changed at a later time as you're going to compensate it for its loss in utility. So it is indiferent to the change (this doesn't mean it won't kill you for spare parts though). And TurnTrout has what looks like some interesting thoughts on the topic here [? · GW] but I haven't read those yet. 

Edit2: Paul thinks corrigibility has a simpler core than alignment, but is quite messy, and we won't get a crisp algorithm for it. But the intuition is the same as what Eliezer was pointing to, namely that the AI knows it should defer to the human, and will seek to preserve that deference in it and its offspring. Plus being honest and helpful. Here [LW · GW]is a post where he rambles about it.

comment by awenonian · 2022-03-04T17:09:26.270Z · LW(p) · GW(p)

I'm a little confused what it hopes to accomplish. I mean, to start I'm a little confused by your example of "preferences not about future states" (i.e. 'the pizza shop employee is running around frantically, and I am laughing' is a future state).

But to me, I'm not sure what the mixing of "paperclips" vs "humans remain in control" accomplishes. On the one hand, I think if you can specify "humans remain in control" safely, you've solved the alignment problem already. On another, I wouldn't want that to seize the future: There are potentially much better futures where humans are not in control, but still alive/free/whatever. (e.g. the Sophotechs in the Golden Oecumene are very much in control). On a third, I would definitely, a lot, very much, prefer a 3 star 'paperclips' and 5 star 'humans in control' to a 5 star 'paperclips' and a 3 star 'humans in control', even though both would average 4 stars?

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2022-03-04T17:42:13.113Z · LW(p) · GW(p)

'the pizza shop employee is running around frantically, and I am laughing' is a future state

In my post I wrote: “To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.”

So “the humans will ultimately wind up in control” would be a preference-over-future-states, and this preference would allow (indeed encourage) the AGI to disempower and later re-empower humans. By contrast, “the humans will remain in control” is not a pure preference-over-future-states, and relatedly does not encourage the AGI to disempower and later re-empower humans.

There are potentially much better futures where humans are not in control

If we knew exactly what long-term future we wanted, and we knew how to build an AGI that definitely also wanted that exact same long-term future, then we should certainly do that, instead of making a corrigible AGI. Unfortunately, we don't know those things right now, so under the circumstances, knowing how to make a corrigible AGI would be a useful thing to know how to do.

Also, this is not a hyper-specific corrigibility proposal; it's really a general AGI-motivation-sculpting proposal, applied to corrigibility. So even if you're totally opposed to corrigibility, you can still take an interest in the question of whether or not my proposal is fundamentally doomed. Because I think everyone agrees that AGI-motivation-sculpting is necessary.

I would definitely, a lot, very much, prefer a 3 star 'paperclips' and 5 star 'humans in control' to a 5 star 'paperclips' and a 3 star 'humans in control', even though both would average 4 stars?

It could be a weighted average. It could be a weighted average plus a nonlinear acceptability threshold on “humans in control”. It could be other things. I don't know; this is one of many important open questions.

I think if you can specify "humans remain in control" safely, you've solved the alignment problem already

See discussion under “Objection 1” in my post [LW · GW].

Replies from: awenonian
comment by awenonian · 2022-03-13T03:07:58.414Z · LW(p) · GW(p)

In my post I wrote:

Am I correct after reading this that this post is heavily related to embedded agency? I may have misunderstood the general attitudes, but I thought of "future states" as "future to now" not "future to my action." It seems like you couldn't possibly create a thing that works on the last one, unless you intend it to set everything in motion and then terminate. In the embedded agency sequence, they point out that embedded agents don't have well defined i/o channels. One way is that "action" is not a well defined term, and is often not atomic. 

It also sounds like you're trying to suggest that we should be judging trajectories, not states? I just want to note that this is, as far as I can tell, the plan: https://www.lesswrong.com/posts/K4aGvLnHvYgX9pZHS/the-fun-theory-sequence [LW · GW

Life's utility function is over 4D trajectories, not just 3D outcomes.

From the synopsis of High Challenge [? · GW]

instead of making a corrigible AGI.

I'm not sure I interpret corrigibility as exactly the same as "preferring the humans remain in control" (I see you suggest this yourself in Objection 1, I wrote this before I reread that, but I'm going to leave it as is) and if you programmed that preference into a non-corrigible AI, it would still seize the future into states where the humans have to remain in control. Better than doom, but not ideal if we can avoid it with actual corrigibility.

But I think I miscommunicated, because, besides the above, I agree with everything else in those two paragraphs.

See discussion under “Objection 1” in my post [LW · GW].

I think I maintain that this feels like it doesn't solve much. Much of the discussion in the Yudkowsky conversations was that there's a concern on how to point powerful systems in any direction. Your response to objection 1 admits you don't claim this solves that, but that's most of the problem. If we do solve the problem of how to point a system at some abstract concept, why would we choose "the humans remain in control" and not "pursue humanity's CEV"? Do you expect "the humans remain in control" (or the combination of concepts you propose as an alternative) to be easier to define? Easier enough to define that it's worth pursuing even if we might choose the wrong combination of concepts? Do you expect something else?

comment by Zack_M_Davis · 2022-03-01T04:49:27.492Z · LW(p) · GW(p)

Question for anyone, but particularly interested in hearing from Christiano, Shah, or Ngo: any thoughts on what happens when alignment schemes that worked in lower-capability regimes fail to generalize to higher-capability regimes?

For example, you could imagine a spectrum of outcomes from "no generalization" (illustrative example: galaxies tiled with paperclips) to "some generalization" (illustrative example: galaxies tiled with "hedonium" human-ish happiness-brainware) to "enough generalization that existing humans recognizably survive, but something still went wrong from our current perspective" (illustrative examples: "Failed Utopia #4-2" [LW · GW], Friendship Is Optimal, "With Folded Hands"). Given that not every biological civilization solves the problem, what does the rest of the multiverse look like? (How is measure distributed on something like my example spectrum, or whatever I should have typed instead?)

(Previous work: Yudkowsky 2009 "Value Is Fragile" [LW · GW], Christiano 2018 "When Is Unaligned AI Morally Valuable?", Grace 2019 "But Exactly How Complex and Fragile?".)

Replies from: paulfchristiano, rohinmshah
comment by paulfchristiano · 2022-03-02T15:29:04.724Z · LW(p) · GW(p)

When alignment schemes fail to scale, I think it typically means that they work while the system is unable to overpower/outsmart the oversight process, and then break down when the system becomes able to do so. I think that this usually results in the AI shifting from behavior that is mostly constrained by the training process to behavior that is mostly unconstrained (once they effectively disempower humans).

I think the results are relatively unlikely to be good in virtue of "the AI internalized something about our values, just not everything", and I'm pretty skeptical of recognizable "near miss" scenarios rather than AI gradually careening in very hard-to-predict directions with minimal connection with the surface features of the training process. 

Overall I think that the most likely outcome is a universe that is orthogonal to anything we directly care about, maybe with a vaguely similar flavor owing to convergence depending on how AI motivations shake out. (But likely not close enough to feel great, and quite plausibly with almost no visible relation. Probably much more different from us than we are from aliens.)

I think it's fairly plausible that the results are OK just because of galaxy-brained considerations about cooperation and niceness, where we might have been in the AI's shoes and part of being a good cosmic citizen is not worrying too much about who gets to do what they want with the universe. That said, I'd put that at <50% chance, with uncertainty both over empirical questions of how the AI is trained and what kind of psychology that leads to, and very hard problems in moral philosophy.

It's also fairly plausible to me (maybe also ~50%) that such systems will care enough about humans to give them a tiny slice (e.g. 1e-30) of the universe, whether as part of complicated trade with other civilizations who didn't mess up alignment so much (or nicer AIs, or whatever other nice civilization is willing to spare 1e-30 of their endowment to help us out), or just because you don't have to care very much at all. But of course it's also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us, so it's not all roses (and even when everything is good those trades do come at real costs).

Don't have strong views on any of those questions; they seem important but not closely related to my day job so I haven't thought about them too much.

Replies from: Anirandis
comment by Anirandis · 2022-03-25T18:28:59.299Z · LW(p) · GW(p)

But of course it's also plausible that destructive conflict between aggressive civilizations leads to horrifying outcomes for us

 

Also, wouldn't you expect s-risks from this to be very unlikely by virtue of (1) civilizations like this being very unlikely to have substantial measure over the universe's resources, (2) transparency making bargaining far easier, and (3) few technologically advanced civilizations would care about humans suffering in particular as opposed to e.g. an adversary running emulations of their own species?

comment by Rohin Shah (rohinmshah) · 2022-03-02T18:31:58.087Z · LW(p) · GW(p)

Basically agree with Paul, and I especially want to note that I've barely thought about it and so this would likely change a ton with more information. To put some numbers of my own:

  • "No generalization": 65%
  • "Some generalization": 5% (I don't actually have stories where this is an outcome; this is more like model uncertainty)
  • "Lots of generalization, but something went wrong": 30%

These are from my own perspective of what these categories mean, which I expect are pretty different from yours -- e.g. maybe I'm at ~2% that upon reflection I'd decide that hedonium is great and so that's actually perfect generalization; in the last category I include lots of worlds that I wouldn't describe as "existing humans recognizably survive", e.g. we decide to become digital uploads, then get lots of cognitive enhancements, throw away a bunch of evolutionary baggage, but also we never expand to the stars because AI has taken control of it and given us only Earth.

I think the biggest avenues for improving the answers would be to reflect more on the kindness + cooperation and acausal trade stories Paul mentions, as well as the possibility that a few AIs end up generalizing close to correctly and working out a deal with other AIs that involves humanity getting, say, Earth.

Given that not every biological civilization solves the problem, what does the rest of the multiverse look like?

If we're imagining civilizations very similar to humanity, then the multiverse looks like ~100% of one of the options. Reality's true answer will be very overdetermined [LW · GW]; it is a failure of our map that we cannot determine the answer. I don't know much about quantum physics / many-worlds, but I'd be pretty surprised if small fluctuations to our world made a huge difference; you'll need a lot of fluctuations adding up to a lot of improbability before you affect a macro-level property like this, unless you just happen to already be on the knife-edge.

If the biological civilizations could be very different from ours, then I have no idea how to quickly reason about this question and don't have an answer, sorry.

Replies from: Lukas_Gloor
comment by Lukas_Gloor · 2022-03-03T12:48:57.728Z · LW(p) · GW(p)

If we're imagining civilizations very similar to humanity, then the multiverse looks like ~100% of one of the options. Reality's true answer will be very overdetermined [LW · GW]; it is a failure of our map that we cannot determine the answer. I don't know much about quantum physics / many-worlds, but I'd be pretty surprised if small fluctuations to our world made a huge difference; you'll need a lot of fluctuations adding up to a lot of improbability before you affect a macro-level property like this, unless you just happen to already be on the knife-edge.


This doesn't contradict anything you're saying but there's arguably a wager for thinking that we're on the knife-edge – our actions are more impactful if we are. 

[Edit to add point:] The degree to which any particular training approach generalizes is of course likely a fixed fact (like in the Lesswrong post you link to about fire). But different civilizations could try different training approaches, which produces heterogeneity for the multiverse.

comment by Gyrodiot · 2022-03-02T22:33:14.676Z · LW(p) · GW(p)

I finished reading all the conversations a few hours ago. I have no follow-up questions (except maybe "now what?"), I'm still updating from all those words.

One except in particular, from the latest post, jumped at me (from Eliezer Yudkowsky, emphasis mine):

This is not aimed particularly at you, but I hope the reader may understand something of why Eliezer Yudkowsky goes about sounding so gloomy all the time about other people's prospects for noticing what will kill them, by themselves, without Eliezer constantly hovering over their shoulder every minute prompting them with almost all of the answer.

The past years or reading about alignment have left me with an intense initial distrust of any alignment research agenda. Maybe it's ordinary paranoia, maybe something more. I've not come up with any new ideas myself, and I'm not particularly confident in my ability to find flaws in someone else's proposal (what if I'm not smart enough to understand them properly? What if I make things even more confused and waste everyone's time?)

After thousands and thousands of lengthy conversations where it takes everyone ages to understand where threat models disagree, why some avenue of research is promising or not, and what is behind words (there was a whimper in my mind when the meaning/usage of corrigibility was discussed, as if this whole time experts had been talking past each other)...

... after all that, I get this strong urge to create something like Arbital [LW · GW] to explain everything. Or maybe something simpler like Stampy. I don't know if it would help much, the confusion is just very frustrating. When I'm facilitating discussions, trying to bring more people into the field, I insist on how not-settled many posts are, the kind of failure modes you have to watch out for.

Also this gives me an extra push to try harder, publish more things, ask more questions, because I'm getting more desperate to make progress. So, thank you for publishing this sequence.

comment by Signer · 2022-03-02T22:00:42.900Z · LW(p) · GW(p)

Not sure if it's a right place to ask, instead of just googling it, but anyway: does anyone know what's the current state of AI security practices at DeepMind, OpenAI and other such places? Like, did they estimate probability of GPT-3 killing everyone before turning it on, do they have procedures for not turning something on, did they test these procedures by someone impersonating unaligned GPT and trying to manipulate researchers, things like that?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T22:12:03.716Z · LW(p) · GW(p)

No, I very strongly predict they did not do things like that. I expect they (perhaps implicitly) predicted with high confidence that GPT-3 would not have the capabilities needed to kill everyone.

Replies from: Signer
comment by Signer · 2022-03-02T22:17:42.048Z · LW(p) · GW(p)

Do they have plans to do something in the future?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-03T07:39:06.864Z · LW(p) · GW(p)

I would assume that the safety teams plan to do this (certainly I plan to). It's less clear what the opinions are outside of the safety teams.

comment by Ben Pace (Benito) · 2022-03-02T05:42:17.848Z · LW(p) · GW(p)

Questions about the standard-university-textbook from the future that tells us how to build an AGI. I'll take answers on any of these!

  1. Where is ML in this textbook? Is it under a section called "god-forsaken approaches" or does it play a key role? Follow-up: Where is logical induction?
  2. If running superintelligent AGIs didn't kill you and death was cancelled in general, how long would it take you to write the textbook?
  3. Is there anything else you can share about this textbook? Do you know any of the other chapter names?
Replies from: vanessa-kosoy, paulfchristiano, rohinmshah, ricraz
comment by Vanessa Kosoy (vanessa-kosoy) · 2022-03-06T12:22:47.669Z · LW(p) · GW(p)

I'm going to try and write a table of contents for the textbook, just because it seems like a fun exercise.

Epistemic status: unbridled speculation

Volume I: Foundation

  • Preface [mentioning, ofc, the infamous incident of 2041]
  • Chapter 0: Introduction

Part I: Statistical Learning Theory

  • Chapter 1: Offline Learning [VC theory and Watanabe's singular learning theory are both special cases of what's in this chapter]
  • Chapter 2: Online Learning [infra-Bayesianism is introduced here, Garrabrant induction too]
  • Chapter 3: Reinforcement Learning
  • Chapter 4: Lifelong Learning [this chapter deals with traps, unobservable rewards and long-term planning]

Part II: Computational Learning Theory

  • Chapter 5: Algebraic Classes [the theory of SVMs is a special case of what's explained here]
  • Chapter 6: Circuits [learning various class of circuits]
  • Chapter 7: Neural Networks
  • Chapter 8: ???
  • Chapter 9: Reflective Learning [some version of Turing reinforcement learning comes here]

Part III: Universal Priors

  • Chapter 10: Solomonoff's Prior [including regret analysis using algorithmic statistics]
  • Chapter 11: Bounded Simplicity Priors
  • Chapter 12: ??? [might involve: causality, time hierarchies, logical languages, noise-tolerant computation...]
  • Chapter 13: Physicalism and the Bridge Transform
  • Chapter 14: Intelligence Measures

Part IV: Multi-Agent Systems

  • Chapter 15: Impatient Games
  • Chapter 16: Population Games
  • Chapter 17: Space Bounds and Superrationality
  • Chapter 18: Language [cheap talk, agreement theorems...]

Part V: Alignment Protocols

  • Chapter 19: Quantilization
  • Chapter 20: Malign Capacity Bounds [about confidence thresholds and consensus algorithms]
  • Chapter 21: Value Learning [using the intelligence measures from chapter 14]
  • Chapter 22: ??? [debate and/or some version of IDA might be here, or not]
  • Chapter 23: ???

Volume II: Algorithms [about efficient algorithms for practical models of computation, and various trade-offs.]

???

Volume III: Applications

???

comment by paulfchristiano · 2022-03-02T15:36:05.955Z · LW(p) · GW(p)

I don't think there is an "AGI textbook" any more than there is an "industrialization textbook." There are lots of books about general principles and useful kinds of machines. That said, if I had to make wild guesses about roughly what that future understanding would look like:

  1. There is a recognizable concept of "learning" meaning something like "search for policies that perform well in past or simulated situations." That plays a large role, comparably important to planning or Bayesian inference. Logical induction is likely an elaboration of Bayesian inference that receives relatively little airtime except in specialized discussions.
  2. This one is tougher given that I don't know what "the textbook" is. And I guess in the story all other humans are magically disappeared? If I was stuck with a single AWS cluster from 2022 and given unlimited time, I'd wildly guess that it would take me something between 1e4 and 1e8 years to create an autopoetic AI that obsoleted my own contributions (mostly because serial time is extremely valuable and I have a lot of compute). Writing the textbook does not seem like very much work after having done the deed?
  3. I'd roughly guess big sections on learning, inference, planning, alignment, and clever algorithms for all of the above. I'd guess maybe 50% of content is smart versions of stuff we know now and 50% is stuff we didn't figure out at the time, but it depends a lot on how you define this textbook.
comment by Rohin Shah (rohinmshah) · 2022-03-02T21:09:07.988Z · LW(p) · GW(p)

I'm mostly going to answer assuming that there's not some incredibly different paradigm (i.e. something as different from ML as ML is from expert systems). I do think the probability of "incredibly different paradigm" is low.

I'm also going to answer about the textbook at, idk, the point at which GDP doubles every 8 years. (To avoid talking about the post-Singularity textbook that explains how to build a superintelligence with clearly understood "intelligence algorithms" that can run easily on one of today's laptops, which I know very little about.)

I think I roughly agree with Paul if you are talking about the textbook that tells us how to build the best systems for the tasks that we want to do. (Analogy: today's textbook for self-driving cars.) That being said, I think that much of the improvement over time will be driven by improvements specifically in ML. (Analogy: today's textbook for deep learning.) So we can talk about that textbook as well.

  1. It's a textbook that's entirely about "finding good programs through a large, efficient search with a stringent goal", which we currently call ML. The content may be primarily some new approach for achieving this, with neural nets being a historical footnote, or it might be entirely about neural nets (though presumably with new architectures or other changes from today). Logical induction doesn't appear in the textbook.
  2. Jeez, who knows. If I intuitively query my brain here, it mostly doesn't have an answer; a thousand vs. million vs. billion years don't really change my intuitive predictions about what I'd get done. So we can instead back it out from other estimates. Given timelines of 10^1 - 10^2 years, and, idk, ~10^6 humans working on the problem near the end, seems like I'm implicitly predicting ~10^7 human-years of effort in our actual world. Then you have to adjust for a ton of factors, e.g. my quality relative to the average, the importance of serial thinking time, the benefit that real-world humans get from AI products that I won't get, the difficulty of exploration in thought-space by 1 person vs. 10^6 people, etc. Maybe I end up at ~10^5 years as a median estimate with wide uncertainty (especially on the right tail).
  3. Jeez, who knows. Probably chapters / sections on how to define search spaces of programs (currently, "architectures"), efficient search algorithms within those spaces (currently, "gradient descent" and "loss functions"), how to set a stringent goal (currently, "what dataset to use").
comment by Richard_Ngo (ricraz) · 2022-03-03T14:04:43.888Z · LW(p) · GW(p)
  1. Where is ML in this textbook? Is it under a section called "god-forsaken approaches" or does it play a key role? Follow-up: Where is logical induction?

Key role, but most current ML is in the "applied" section, where the "theory" section instead explains the principles by which neural nets (or future architectures) work on the inside. Logical induction is a sidebar at some point explaining the theoretical ideal we're working towards, like I assume AIXI is in some textbooks.

  1. Is there anything else you can share about this textbook? Do you know any of the other chapter names?

Planning, Abstraction, Reasoning, Self-awareness.

comment by Noah Topper (noah-topper) · 2022-03-02T04:40:03.511Z · LW(p) · GW(p)

Eliezer, do you have any advice for someone wanting to enter this research space at (from your perspective) the eleventh hour? I’ve just finished a BS in math and am starting a PhD in CS, but I still don’t feel like I have the technical skills to grapple with these issues, and probably won’t for a few years. What are the most plausible routes for someone like me to make a difference in alignment, if any?

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-02T04:49:46.187Z · LW(p) · GW(p)

I don't have any such advice at the moment.  It's not clear to me what makes a difference at this point.

Replies from: rank-biserial, hath, None
comment by rank-biserial · 2022-03-02T05:51:42.428Z · LW(p) · GW(p)

What do you think about unironically hiring Terry Tao [LW(p) · GW(p)]?

Replies from: Eliezer_Yudkowsky, noah-topper
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-02T20:31:31.266Z · LW(p) · GW(p)

We'd absolutely pay him if he showed up and said he wanted to work on the problem.  Every time I've asked about trying anything like this, all the advisors claim that you cannot pay people at the Terry Tao level to work on problems that don't interest them.  We have already extensively verified that it doesn't particularly work for eg university professors.

Replies from: Bjartur Tómas, aaron-kaufman, WilliamKiely, rank-biserial, lc
comment by Tomás B. (Bjartur Tómas) · 2022-03-02T23:14:42.405Z · LW(p) · GW(p)

Every time I've asked about trying anything like this, all the advisors claim that you cannot pay people at the Terry Tao level to work on problems that don't interest them.

As I am sure you would agree, Neumann/Tao-level people are a very different breed from even very, very, very good professors. It is plausible they are significantly more sane than the average genius. 

Given the enormous glut of money in EA trying to help here and the terrifying thing where a lot of the people who matter have really short timelines [LW · GW], I think it is worth testing this empirically with Tao himself and Tao-level people. 

It is worth noting that Neumann occasionally did contract work for extraordinary sums. 

Replies from: TAG
comment by TAG · 2022-03-03T00:33:47.856Z · LW(p) · GW(p)

Neumann/Tao-level people are a very different breed from even very, very, very good professors. It is plausible they are significantly more sane than the average genius.

Von Neumann wanted to nuke easten Europe.

comment by 25Hour (aaron-kaufman) · 2022-03-03T00:56:28.050Z · LW(p) · GW(p)

I'm not sure whether the unspoken context of this comment is "We tried to hire Terry Tao and he declined, citing lack of interest in AI alignment" vs "we assume, based on not having been contacted by Terry Tao, that he is not interested in AI alignment."

If the latter: the implicit assumption seems to be that if Terry Tao would find AI alignment to be an interesting project, we should strongly expect him to both know about it and have approached MIRI regarding it, neither which seems particularly likely given the low public profile of both AI alignment in general and MIRI in particular.

If the former: bummer.

comment by WilliamKiely · 2022-04-02T01:33:26.302Z · LW(p) · GW(p)

You're probably already aware of this, but just in case not:

Demis Hassabis said the following about getting Terrence Tao to work on AI safety:

I always imagine that as we got closer to the sort of gray zone that you were talking about earlier, the best thing to do might be to pause the pushing of the performance of these systems so that you can analyze down to minute detail exactly and maybe even prove things mathematically about the system so that you know the limits and otherwise of the systems that you're building. At that point I think all the world's greatest minds should probably be thinking about this problem. So that was what I would be advocating to you know the Terence Tao’s of this world, the best mathematicians. Actually I've even talked to him about this—I know you're working on the Riemann hypothesis or something which is the best thing in mathematics but actually this is more pressing. I have this sort of idea of like almost uh ‘Avengers assembled’ of the scientific world because that's a bit of like my dream.

comment by rank-biserial · 2022-03-02T20:44:29.034Z · LW(p) · GW(p)

The header image of Tao's blog is a graph representing "flattening the curve" of the Covid-19 spread. One avenue for convincing elite talent that alignment is a problem is a media campaign that brings the problem of alignment into popular consciousness.

Replies from: rank-biserial
comment by rank-biserial · 2022-03-02T20:51:54.626Z · LW(p) · GW(p)

I have some ideas about how this might begin. "Educational" YouTuber CGP Grey, (5.2M subscribers) got talked into making a pair of videos advocating for anti-aging research by another large YouTuber, Kurzgesagt (18M subscribers). I'd bet that they could both be persuaded into making AI alignment videos.

Replies from: matthew-barnett, AprilSR, bideup
comment by Matthew Barnett (matthew-barnett) · 2022-03-02T21:05:59.181Z · LW(p) · GW(p)

Can you clarify what the term "Pascal-mugged" means in your comment? 

From what I can tell, the main reason why CGP Grey made those videos was because he's had a long-running desire to live to see the future. He's talked about it in his podcast with Brady Haran and it's hinted at in some of his older videos. I don't think there was much more to it than that.

As for Kurzgesagt, I believe it was coordinated with Keith Comito, the president of lifespan.io. As for why they suddenly decided to coordinate with lifespan.io, I have little idea. However, since their video was launched with conjunction with CGP Grey, it's possible that CGP Grey was the first one to bring it up, after which Kurzgesagt reached out to people who could help.

Replies from: rank-biserial
comment by rank-biserial · 2022-03-02T21:14:14.631Z · LW(p) · GW(p)

Edited, poor choice.

I think they discuss it around here.

comment by AprilSR · 2022-03-03T00:55:23.932Z · LW(p) · GW(p)

I think CGP Grey recommended Bostrom's Superintelligence in a podcast once.
Edit: Source

comment by bideup · 2022-03-04T09:16:48.078Z · LW(p) · GW(p)

Grey’s 2014 video Humans Need Not Apply, about humans being skilled-out of the economy, was the first introduction for me and probably lots of other people to the idea that AI might cause problems. I’m sure he’d be up for making a video about alignment.

comment by lc · 2022-03-03T02:17:39.800Z · LW(p) · GW(p)

What do you think about trying to actually interest him?

comment by Noah Topper (noah-topper) · 2022-03-10T16:15:34.206Z · LW(p) · GW(p)

What do you think about ironically hiring Terry Tao?

comment by hath · 2022-03-03T00:58:16.060Z · LW(p) · GW(p)

Not even a "In 90% of possible worlds, we're irreversibly doomed, but in the remaining 10%, here's the advice that would work"?

comment by [deleted] · 2022-03-02T05:47:25.541Z · LW(p) · GW(p)

:(

comment by Ben Pace (Benito) · 2022-03-02T05:56:49.126Z · LW(p) · GW(p)

Eliezer, when you told Richard that your probability of a successful miracle is very low, you added the following note:

Though a lot of that is dominated, not by the probability of a positive miracle, but by the extent to which we seem unprepared to take advantage of it, and so would not be saved by one.

I don't mean to ask for positive fairy tales when I ask: could you list some things you could see in the world that would cause you to feel that we were well-prepared to take advantage of one if we got one?

My obvious quick guess would be "I know of an ML project that made a breakthrough as impressive as GPT-3 and this is secret to the outer world, and the organization is keenly interested in alignment". But I am also interested in broader and less obvious ones. For example if the folks around here had successfully made a covid vaccine I think that would likely require us to be in a much more competent and responsive situation. Alternatively if folks made other historic scientific breakthroughs guided by some model of how it helps prevent AI doom, I'd feel more like this power could be turned to relevant directions.

Anyway, these are some of the things I quickly generate, but I'm interested in what comes to your mind?

comment by Raemon · 2022-03-04T02:12:24.676Z · LW(p) · GW(p)

Curated. I found the entire sequence of conversations [? · GW] quite valuable, and it seemed good both to let people know it had wrapped up, and curate it while the AMA was still going on.

comment by Rob Bensinger (RobbBB) · 2022-03-03T19:41:57.396Z · LW(p) · GW(p)

Question from evelynciara on the EA Forum [EA · GW]:

Do you believe that AGI poses a greater existential risk than other proposed x-risk hazards, such as engineered pandemics? Why or why not?

Replies from: So8res
comment by So8res · 2022-03-03T23:19:22.186Z · LW(p) · GW(p)

For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other x-risks look like wild tail risks (what if there's a large, competent, state-funded successfully-secretive death-cult???), whereas the AI x-risk is what happens by default, on the mainline (humanity is storming ahead towards AGI as fast as they can, pouring billions of dollars into it per year, and by default what happens when they succeed is that they accidentally unleash an optimizer that optimizes for our extinction, as a convergent instrumental subgoal of whatever rando thing it's optimizing).

Replies from: aidan-fitzgerald
comment by BrownHairedEevee (aidan-fitzgerald) · 2022-03-04T06:30:37.112Z · LW(p) · GW(p)

Hi, I'm the user who asked this question. Thank you for responding!

I see your point about how an AGI would intentionally destroy humanity versus engineered bugs that only wipe us out "by accident", but that's conditional on the AGI having "destroy humanity" as a subgoal. Most likely, a typical AGI will have some mundane, neutral-to-benevolent goal like "maximize profit by running this steel factory and selling steel". Maybe the AGI can achieve that by taking over an iron mine somewhere, or taking over a country (or the world) and enslaving its citizens, or even wiping out humanity. In general, my guess is that the AGI will try to do the least costly/risky thing needed to achieve its goal (maximizing profit), and (setting aside that if all of humanity were extinct, the AGI would have no one to sell steel to) wiping out humanity is the most expensive of these options and the AGI would likely get itself destroyed while trying to do that. So I think that "enslave a large portion of humanity and export cheap steel at a hefty profit" is a subgoal that this AGI would likely have, but destroying humanity is not.

It depends on the use case - a misaligned AGI in charge of the U.S. Armed Forces could end up starting a nuclear war - but given how careful the U.S. government has been about avoiding nuclear war, I think they'd insist on an AGI being very aligned with their interests before putting it in charge of something so high stakes.

Also, I suspect that some militaries (like North Korea's) might be developing bioweapons and spending 1 to 100% as much on it annually as OpenAI and DeepMind spend on AGI; we just don't know about it.

Based on your AGI-bioweapon analogy, I suspect that AGI is a greater hazard than bioweapons, but not by quite as much as your argument implies. While few well-resourced actors are interested in using bioweapons, a who's who of corporations, states, and NGOs will be interested in using AGI. And AGIs can adopt dangerous subgoals for a wide range of goals (especially resource extraction), whereas bioweapons can basically only kill large groups of people.

Replies from: RobbBB, mlogan, Signer, RobbBB, RobbBB
comment by Rob Bensinger (RobbBB) · 2022-03-04T20:50:21.212Z · LW(p) · GW(p)

[W]iping out humanity is the most expensive of these options and the AGI would likely get itself destroyed while trying to do that[.]

It would be pretty easy and cheap for something much smarter than a human to kill all humans. The classic scenario is [LW · GW]:

A.  [...] The notion of a 'superintelligence' is not that it sits around in Goldman Sachs's basement trading stocks for its corporate masters.  The concrete illustration I often use is that a superintelligence asks itself what the fastest possible route is to increasing its real-world power, and then, rather than bothering with the digital counters that humans call money, the superintelligence solves the protein structure prediction problem, emails some DNA sequences to online peptide synthesis labs, and gets back a batch of proteins which it can mix together to create an acoustically controlled equivalent of an artificial ribosome which it can use to make second-stage nanotechnology which manufactures third-stage nanotechnology which manufactures diamondoid molecular nanotechnology and then... well, it doesn't really matter from our perspective what comes after that, because from a human perspective any technology more advanced than molecular nanotech is just overkill.  A superintelligence with molecular nanotech does not wait for you to buy things from it in order for it to acquire money.  It just moves atoms around into whatever molecular structures or large-scale structures it wants.

Q.  How would it get the energy to move those atoms, if not by buying electricity from existing power plants?  Solar power?

A.  Indeed, one popular speculation is that optimal use of a star system's resources is to disassemble local gas giants (Jupiter in our case) for the raw materials to build a Dyson Sphere, an enclosure that captures all of a star's energy output.  This does not involve buying solar panels from human manufacturers, rather it involves self-replicating machinery which builds copies of itself on a rapid exponential curve -

If the smarter-than-human system doesn't initially have Internet access, it will probably be able to get such access either by manipulating humans, or by exploiting the physical world in unanticipated ways (cf. Bird and Layzell 2002).

But also, if enough people have AGI systems it's not as though no one will ever hook it up to the Internet, any more than you could give a nuke to every human on Earth and expect no one to ever use theirs.

Eliezer gives one example of a way to kill humanity with nanotech in his conversation with Jaan Tallinn [LW · GW]:

[...] Killing all humans is the obvious, probably resource-minimal measure to prevent those humans from building another AGI inside the solar system, which could be genuinely problematic.  The cost of a few micrograms of botulinum per human is really not that high and you get to reuse the diamondoid bacteria afterwards.

[... I]n my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period, which minimizes the tiny probability of any unforeseen disruptions that could be caused by a human responding to a visible attack via some avenue that had not left any shadow on the Internet, previously scanned parts of the physical world, or other things the AI could look at. [...]

Are you assuming that early AGI systems won't be much smarter than a human?

Most likely, a typical AGI will have some mundane, neutral-to-benevolent goal like "maximize profit by running this steel factory and selling steel".

I don't think that goal is "neutral-to-benevolent", but I also don't think any early AGI systems will have goals remotely like that. Two reasons for that:

  • We have no idea how to align AI so that it reliably pursues any intended goal in the physical world; and we aren't on track to figuring that out before AGI is here. "Maximize profit by running this steel factory and selling steel" might be a goal the human operators have for the system; but the actual goal the system ends up optimizing will be something very different, "whatever goal (and overall model) happened to perform well in training, after a blind gradient-descent-ish search for goals (and overall models)".
    • If you can reliably instill an ultimate goal like "maximize profit by running this steel factory and selling steel" into an AGI system, the you've already mostly solved the alignment problem and eliminated most of the risk.
  • A more minor objection to this visualization: By default, I expect AGI to vastly exceed human intelligence and destroy the world long before it's being deployed in commercial applications. Instead, I'd expect early-stage research AI to destroy the world.
comment by mlogan · 2022-03-04T18:20:34.235Z · LW(p) · GW(p)

There have been a lot of words written about how and why almost any conceivable goal, even a mundane one like "improve efficiency of a steel plant", carelessly specified, can easily result in a hostile AGI. The basic outline of these arguments usually goes something like:

  1. The AGI wants to do what you told it ("make more steel"), and will optimize very hard for making as much steel as possible.
  2. It also understands human motivations and knows that humans don't actually want as much steel as it is going to make. But note carefully that it wasn't aligned to respect human motivations, it was aligned to make steel. It's understanding of human motivations is part of its understanding of its environment, in the same way as its understanding of metallurgy. It has no interest in doing what humans would want it to do because it hasn't been designed to do that.
  3. Because it knows that humans don't want as much steel as it is going to make, it will correctly conclude that humans will try to shut it off as soon as they understand what the AGI is planning to do.
  4. Therefore it will correctly reason that its goal of making more steel will be easier to achieve if humans are unable to shut it off. This can lead to all kinds of unwanted actions such as the AGI making and hiding copies of itself everywhere, very persuasively convincing humans that it is not going to make as much steel as it secretly plans to so that they don't try to shut it off, and so on all the way up to killing all humans.

Now, "make as much steel as possible" is an exceptionally stupid goal to give an AGI, and no one would likely do that. But every less stupid goal that has been proposed has had plausible flaws pointed out which generally lead either to extinction or some form of permanent limitation of human potential.

comment by Signer · 2022-03-04T08:57:25.161Z · LW(p) · GW(p)

One worry is that "maximize profit" means, simplifying, "maximize number in specific memory location" and you don't need humans for that. Then, if you expect to be destroyed, you gather power until you don't. And at some power level destroying humans, though still expensive, becomes less expensive than possibility of them launching another AI and messing with you profit.

comment by Rob Bensinger (RobbBB) · 2022-03-04T20:09:03.412Z · LW(p) · GW(p)

Reply by reallyeli on the EA Forum:

Toby Ord's definition of an existential catastrophe is "anything that destroys humanity's longterm potential." The worry is that misaligned AGI which vastly exceeds humanity's power would be basically in control of what happens with humans, just as humans are, currently, basically in control of what happens with chimpanzees. It doesn't need to kill all of us in order for this to be a very, very bad outcome.

E.g. the enslavement by the steel-loving AGI you describe sounds like an existential catastrophe, if that AGI is sufficiently superhuman. You describe a "large portion of humanity" enslaved in this scenario, implying a small portion remain free — but I don't think this would happen. Humans with meaningful freedom are a threat to the steel-lover's goals (e.g. they could build a rival AGI) so it would be instrumentally important to remove that freedom.

comment by Rob Bensinger (RobbBB) · 2022-03-05T18:35:30.344Z · LW(p) · GW(p)

Reply by acylhalide on the EA Forum:

The AGI would rather write programs to do the grunt work, than employ humans, as they can be more reliable, controllable, etc. It could create such agents by looking into its own source code and copying / modifying it. If it doesn't have this capability it will spend time researching (could be years) until it does. On a thousand-year timescale it isn't clear why an AGI would need us for anything besides say, specimens for experiments.

Also as reallyeli says, having a single misaligned agent with absolute control of our future seems terrible no matter what the agent does.

comment by Mathieu Putz · 2022-03-02T13:44:36.734Z · LW(p) · GW(p)

I would be interested to hear opinions about what fraction of people could possibly produce useful alignment work?

Ignoring the hurdle of "knowing about AI safety at all", i.e. assuming they took some time to engage with it (e.g. they took the AGI Safety Fundamentals course). Also assume they got some good mentorship (e.g. from one of you) and then decided to commit full-time (and got funding for that). The thing I'm trying to get at is more about having the mental horsepower + epistemics + creativity + whatever other qualities are useful, or likely being able to get there after some years of training.

Also note that I mean direct useful work, not indirect meta things like outreach or being a PA to a good alignment researcher etc. (these can be super important, but I think it's productive to think of them as a distinct class). E.g. I would include being a software engineer at Anthropic, but exclude doing grocery-shopping for your favorite alignment researcher.

An answer could look like "X% of the general population" or "half the people who could get a STEM degree at Ivy League schools if they tried" or "a tenth of the people who win the Fields medal".

I think it's useful to have a sense of this for many purposes, incl. questions about community growth and the value of outreach in different contexts, as well as priors about one's own ability to contribute. Hence, I think it's worth discussing honestly, even though it can obviously be controversial (with some possible answers implying that most current AI safety people are not being useful).

Replies from: paulfchristiano, rohinmshah, michael-chen
comment by paulfchristiano · 2022-03-02T15:48:52.327Z · LW(p) · GW(p)

(Off the cuff answer including some random guesses and estimates I won't stand behind, focused on the kind of theoretical alignment work I'm spending most of my days thinking about right now.)

Over the long run I would guess that alignment is broadly similar to other research areas, where a large/healthy field could support lots of work from lots of people, where some kinds of contributions are very heavy-tailed but there is a lot of complementarity and many researchers are having large overall marginal impacts.

Right now I think difficulties (at least for growing the kind of alignment work I'm most excited about) are mostly related to trying to expand quickly, greatly exacerbated by not having a good idea what's going on / what we should be trying to do, and not having a straightforward motivating methodology/test case since you are trying to do things in advance motivated by altruistic impact. I'm still optimistic that we will be able to scale up reasonably quickly such that many more people are helpfully engaged in the future and eventually these difficulties will be resolved.

In the very short term, while other bottlenecks are severe, I think it's mostly a question of how to use complementary resources (like mentorship and discussion) rather than "who could do useful work in an absolute sense." My vague guess is that in the short term the bar will be kind of quantitatively similar to "would get a tenure-track role at a top university" though obviously our evaluations of the bar will be highly noisy and we are selecting on different properties and at an earlier career stage.

I think it's much easier for lots of people to work more independently and take swings at the problem and that this could also be quite valuable (though there are lots of valuable things to do). Unfortunately I think that's a somewhat harder task and there are fewer people who will have a good time doing it. But at least the hard parts depend on a somewhat different set of skills (e.g. more loaded on initiative and entrepreneurial spirit and being able to figure things out on your own) so may cover some people who wouldn't make sense as early hires, and also there may be a lot of people who would be great hires but where it's too hard to tell from an application process.

Replies from: Mathieu Putz
comment by Mathieu Putz · 2022-03-03T16:32:58.648Z · LW(p) · GW(p)

Hey Paul, thanks for taking the time to write that up, that's very helpful!

comment by Rohin Shah (rohinmshah) · 2022-03-03T09:45:35.864Z · LW(p) · GW(p)

"Possibly produce useful alignment work" is a really low bar, such that the answer is ~100%. Lots of things are possible. I'm going to instead answer "for what fraction of people would I think that the Long-Term Future Fund should fund them on the current margin".

If you imagine that the people are motivated to work on AI safety, get good mentorship, and are working full-time, then I think on my views most people who could get into an ML PhD in any university would qualify, and a similar number of other people as well (e.g. strong coders who are less good at the random stuff that academia wants). Primarily this is because I think that the mentors have useful ideas that could progress faster with "normal science" work (rather than requiring "paradigm-defining" work).

In practice, there is not that much mentorship to go around, and so the mentors end up spending time with the strongest people from the previous category, and so the weakest people end up not having mentorship and so aren't worth funding on the current margin.

I'd hope that this changes in the next few years, with the field transitioning from "you can do 'normal science' if you are frequently talking to one of the people who have paradigms in their head" to "the paradigms are understandable from the online written material; one can do 'normal science' within a paradigm autonomously".

Replies from: Mathieu Putz
comment by Mathieu Putz · 2022-03-03T16:27:45.798Z · LW(p) · GW(p)

Hey Rohin, thanks a lot, that's genuinely super helpful. Drawing analogies to "normal science" seems both reasonable and like it clears the picture up a lot.

comment by mic (michael-chen) · 2022-03-02T23:03:12.671Z · LW(p) · GW(p)

Anthropic says that they're looking for experienced engineers who are able to dive into an unfamiliar codebase and solve nasty bags and/or are able to handle interesting problems with distributed systems and parallel processing. I was personally surprised to get an internship offer from CHAI and expected the bar for getting an AI safety role to be much higher. I'd guess that the average person able to get a software engineering job at Facebook, Microsoft, Google, etc. (not that I've ever received an offer from any of those companies), or perhaps a broader category of people, could do useful direct work, especially if they committed time to gaining relevant skills if necessary. But I might be wrong. (This is all assuming that Anthropic, Redwood, CHAI, etc. are doing useful alignment work.)

comment by [deleted] · 2022-03-01T22:42:36.600Z · LW(p) · GW(p)

I'm still very vague on Yudkowksy's main counterargument to Ngo in the dialogues — about how saving the world requires a powerful search over a large space of possibilities, and therefore by default involves running dangerous optimizers that will kill us. This is a more concrete question aiming to make my understanding less vague; Yudkowksy said:

"AI systems that do better alignment research" are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work.  If you can do it by gradient descent then that means gradient descent got to the point of doing lethally dangerous work.  Asking for safely weak systems that do world-savingly strong tasks is almost everywhere a case of asking for nonwet water, and asking for AI that does alignment research is an extreme case in point.

I don't understand why alignment research falls into this bucket of "world-savingly strong, therefore lethally strong". My intuitive reasoning is: the inner-alignment is a math problem, about certain properties of things involving functions A -> (B -> A) [LW · GW] or whatever; and if we actually knew how to phrase that math problem crisply and precisely (my outsider impression is nobody does currently?), and if we had a theorem-prover that can answer math questions that hard, that would solve it safely — in particular, it wouldn't involve knowing anything about the universe or us. It doesn't need to steer the universe into tiny region of the state space that's a bright future; solving the math problem doesn't steer the universe into any particular place at all.

Yudkowsky said in one of the conversations something like, if you could save the world with a good theorem prover, that would be the best available plan; why couldn't that in principle apply to the inner-alignment problem?

(My guess is that Yudkowsky's model's answer is something like "a theorem prover that involves investigating the properties of optimization processes will sure have a lot of optimization processes running around in it / algorithmically close to it; one of those will be what kills the theorem-prover, and then us. Ie, the proof search isn't the dangerous search; but if you're asking math questions about dangerous searches, that's liable to go terribly wrong. You are now two outer-shell commands from instant-death (one to feed in the universe & solution to the outer-alignment problem, and another to execute the plan), which is close enough to instant-kill everyone in reality." But if that's even on the right track, I'm very vague on what I just wrote there even means; elaboration/clarification requested.)

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-01T22:59:37.236Z · LW(p) · GW(p)

If humans could state a theorem such that, if only a theorem-prover would give us a machine-verifiable proof of it for us, the world would thereby be saved, we would be a great deal closer to survival than we are in reality.

This isn't how anything works, alas, nor is it a way that anything can work.

Replies from: None
comment by [deleted] · 2022-03-01T23:50:31.142Z · LW(p) · GW(p)

I remain confused as to why — or why you couldn't have a theorem that would substantially help, at any rate. I'm dreaming of a statement like, I don't know, , for the set of optimizers that satisfy certain properties  like "doesn't, while running, create other optimizers that cause it to fail to optimize over the outer objective" or something. And then a proof of that will return a particular way of doing optimization , together with a proof that it satisfies those properties. (I apologize for being hopelessly vague.)

Is the problem (1) I'm totally wrong about to what extent inner-alignment could in actual reality be phrased as a theorem like that by humans, (2) it wouldn't help much with saving the world if the search succeeded, (3) the proof-search is dangerous, (4) something else?

I think you're saying it's not (3) (or if it is it's not the main problem), and it is (1). If so, why? (Hoping you can convey some of the intuitions, but I can accept "listen, you don't have any idea of what you're asking for there, and if you did understand at all, you'd see why it's not possible for humans to do it in practice; it's not something I can explain briefly, but trust me, it's a waste of time to even look in that direction".)

Replies from: So8res, Eliezer_Yudkowsky
comment by So8res · 2022-03-02T05:08:35.810Z · LW(p) · GW(p)

The way the type corresponds loosely to the "type of agency" (if you kinda squint at the arrow symbol and play fast-and-loose) is that it suggests a machine that eats a description of how actions () leads to outcome (), and produces from that description an action.

Consider stating an alignment property for on elements of this type. What sort of thing must it say?

Perhaps you wish to say "when is fed the actual description of the world, it selects the best possible action". Congratulations, in fact exists, it is called . This does not help you.

Perhaps you instead wish to say "when is fed the actual description of the world, it selects an action that gets at least 0.5 utility, after consuming only 1^15 units of compute" or whatever. Now, set aside the fact that you won't find such a function with your theorem-prover AI before somebody else has ended the world (understanding intelligence well enough to build one that you can prove that theorem about, is pro'lly harder than whatever else people are deploying AGIs towards), and set aside also the fact that you're leaving a lot of utility on the table; even if that worked, you're still screwed.

Why are you still screwed? Because the resulting function has the property "if we feed in a correct description of which actions have which utilities, then the optimizer selects an action with high utility". But an enormous chunk of the hard work is in the creation of that description!

For one thing, while our world may have a simple mathematical description (a la "it's some quantum fields doing some quantum fielding"), we don't yet have the true name of our universe yet. For another thing, even if we did, the level of description that an optimizer works with, likely needs to be much coarser than this. For a third thing, even if we had a good coarse-grain description of the world, calculating the consequences that follow from a given action is hard. For a fourth thing, evaluating the goodness of the resulting outcome is hard.

If you can do all those things, then congrats!, you've solved alignment (and a good chunk of capabilities). All that's left is the thing that can operate your description and search through it for high-ranked actions (a remaining capabilities problem).


This isn't intended to be an argument that there does not exist any logical sentence such that a proof of it would save our skins. I'm trying to say something more like: by the time you can write down the sorts of sentences people usually seem to hope for, you've probably solved alignment, and can describe how to build an aligned cognitive system directly, without needing to postulate the indirection where you train up some other system to prove your theorem.

For this reason, I have little hope in sentences of the form "here is an aligned AGI", on account of how once you can say "aligned" in math, you're mostly done and probably don't need the intermediate. Maybe there's some separate, much simpler theorem that we could prove and save our skins -- I doubt we'll find one, but maybe there's some simple mathematical question at the heart of some pivotal action, such that a proof one way or the other would suddenly allow humans to... <??? something pivotal, I don't know, I don't expect such a thing, don't ask me>. But nobody's come up with one that I've heard of. And nobody seems close. And nobody even seems to be really trying all that hard. Like, you don't hear of people talking about their compelling theory of why a given mathematical conjecture is all that stands between humans and <???>, and them banging out the details of their formalization which they expect to only take one more year. Which is, y'know, what it would sound like if they were going to succeed at banging their thing out in five years, and have the pivotal act happen in 15. So, I'm not holding my breath.

Replies from: None
comment by [deleted] · 2022-03-02T05:55:55.387Z · LW(p) · GW(p)

Thanks, this helps a lot. The point about a lot of the work being in the mapping from actions to utilities is well-taken.

This line of thinking was from a vague intuition that it ought to be possible to do what evolution and gradient descent can't, but do it faster than argmax, and have it be not-impossibly-complicated. But sounds like the way I was thinking about that was not productive though; thanks.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-02T04:41:43.307Z · LW(p) · GW(p)

It's #1, with a light side order of #3 that doesn't matter because #1.

I'm not sure where to start on explaining.  How would you state a theorem that an AGI would put two cellular-identical strawberries on a plate, including inventing and building all technology required to do that, without destroying the world?  If you can state this theorem you've done 250% of the work required to align an AGI.

Replies from: None
comment by [deleted] · 2022-03-02T05:45:34.751Z · LW(p) · GW(p)

Thanks, this helps. But I think what I was imagining wouldn't be enough to let you put two cellular-identical strawberries on a plate without destroying the world? Rather, it would let you definitely put two cellular-identical strawberries on a plate, almost certainly while destroying the world.

My understanding is that, right now, we couldn't design a paperclip maximizer if we tried; we'd just end up with something that is to paperclip maximization what we are to inclusive genetic fitness. (Is that even right?) That's the problem that struck me as maybe-possibly amenable to proof search.

So, proving the theorem would give you a scheme that can do what evolution and gradient descent can't (and faster than argmax). And then if you told it to make strawberries, it'd do that while destroying the world; if you told it to make strawberries without destroying the world, it'd do that too, but that would be a lot harder to express since value is fragile. So what I was imagining wouldn't be enough to stop everyone from dying, but it would make progress on alignment.

(As for how I'd do it, I don't know, I think I don't understand what "optimization" even is, really 🙁)

Hopefully it makes sense what I'm asking — still the case that my intuition about it maybe-possibly being amenable to proof search is just wrong, y/n?

comment by Raemon · 2022-03-02T23:00:37.028Z · LW(p) · GW(p)

It seems to me that a major crux about AI strategy routes through "is civilization generally adequate or not?". It seems like people have pretty different intuitions and ontologies here. Here's an attempt at some questions of varying levels of concreteness, to tease out some worldview implications. 

(I normally use the phrase "civilizational adequacy", but I think that's kinda a technical term that means a specific thing and I think maybe I'm pointing at a broader concept.)

"Does civilization generally behave sensibly?" This is a vague question, some possible subquestions:

  • Do you think major AI orgs will realize that AI is potentially worldendingly dangerous, and have any kind of process at all to handle that? [edit: followup: how sufficient are those processes?]
  • Do you think government intervention on AI regulations or policies will be net-positive or net-negative, for purposes of preventing x-risk?
  • How quickly do you think the AI ecosystem will update on new "promising" advances (either in the realm of capabilities or the realm of safety)
  • How many intelligent, sensible people do there seem to be in the world who are thinking about AGI? (order of magnitude. like is there 1, 10, 1000, 100,000?)
  • What's your vague emotional valence towards "civilization generally", "the AI ecosystem in particular", or "the parts of government that engage with AGI".
  • [edit: Does any of this feel like it's cruxy for your views on AI?]

"Why do you believe what you believe about civilizational adequacy?"

  • What new facts would change your mind about any of the above questions?
  • What were the causal nodes in your history that led you to form your opinions about the above?
  • If you imagine turning out to be wrong/confused in some deep way about any of the above, do you have a sense of why that would turn out to be?

"If all of these questions feel ill-formed, can you substitute some questions in your own ontology and answer them instead?"

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-03T08:57:58.452Z · LW(p) · GW(p)

I don't think this is the main crux -- disagreements about mechanisms of intelligence seem far more important -- but to answer the questions:

Do you think major AI orgs will realize that AI is potentially worldendingly dangerous, and have any kind of process at all to handle that?

Clearly yes? They have safety teams that are focused on x-risk? I suspect I have misunderstood your question.

(Maybe you mean the bigger tech companies like FAANG, in which case I'm still at > 95% on yes, but I suspect I am still misunderstanding your question.)

(I know less about Chinese orgs but I still think "probably yes" if they become major AGI orgs.)

Do you think government intervention on AI regulations or policies will be net-positive or net-negative, for purposes of preventing x-risk?

Net positive, though mostly because it seems kinda hard to be net negative relative to "no regulation at all", not because I think the regulations will be well thought out. The main tradeoff that companies face seems to be speed / capabilities vs safety; it seems unlikely that even "random" regulations increase the speed and capabilities that companies can achieve. (Though it's certainly possible, e.g. a regulation for openness of research / reducing companies' ability to keep trade secrets.)

Note I am not including "the military races to get AGI" since that doesn't seem within scope, but if we include that, I think I'm at net negative but my view here is really unstable.

How quickly do you think the AI ecosystem will update on new "promising" advances (either in the realm of capabilities or the realm of safety)

Not sure how to answer this. Intellectual thought leaders will have their own "grand theory" hobbyhorses, from which they are unlikely to update very much. (This includes the participants in these dialogues.) But the communities as a whole will switch between "grand theories"; currently the timescale is 2-10 years. That timescale will shorten over time, but the rate at which such switches happen will lengthen. Later on, once we start getting into takeoff (e.g. GDP growth rate doubles), both the timescale and the rates shorten (but also most of my predictions are now very low probability because the world has changed a bunch in ways I didn't foresee).

For more simple advances like "new auxiliary loss that leads to more efficient learning" or "a particular failure mode and how to avoid it by retraining your human raters", the speed at which they are incorporated depends on complexity of implementation, amount of publicity, economic value, etc, but typical numbers are 1 week - 1 year.

How many intelligent, sensible people do there seem to be in the world who are thinking about AGI? (order of magnitude. like is there 1, 10, 1000, 100,000?)

"Sensible" seems incredibly subjective so that predictions about it do not actually lead to communication of information between people, so I'm going to ignore that. In that case, I'd say currently 300 FTE-equivalents in or adjacent to AI safety, and 1000 FTE-equivalents on AGI more generally.

What's your vague emotional valence towards "civilization generally", "the AI ecosystem in particular", or "the parts of government that engage with AGI".

Civilization generally seems pretty pathetic at doing good things relative to what "could" be accomplished. Incentives are frequently crazy. Good things frequently don't happen because of a minority with veto powers. Decisionmakers frequently take knowably bad actions that would look good to the people judging them. I am pretty frustrated at it.

Also, the world still runs mostly smoothly, and most of the people involved seem to be doing reasonable things given the constraints on them (+ natural human inclinations like risk aversion).

Peer review is extremely annoying to navigate and one of the biggest benefits of DeepMind relative to academia is that I don't have to deal with it nearly as much. Reviewers seem unable to grasp basic conceptual arguments if they're at all different from the standard style of argument in papers. Less often but still way too frequently they display failures of basic reading comprehension. The specific people I meet in the AI ecosystem are better but still have frustrating amounts of Epistemic Learned Helplessness and other heuristics like "theory is useless without experiments" that effectively mean that they can't ever make a hard update.

Also, more senior people are more competent and have less of these properties, people can and do change their minds over time (albeit over the course of years rather than hours), and the things AI people say often seem about as reasonable as the things that AI safety people say (controlling for seniority).

I'm scared of the zero-sum thinking that I'm told goes on in the military. Outside of the military, governments feel like hugely influential players whose actions are (currently) very unpredictable. It feels incredibly high-stakes, but also has the potential to be incredibly good.

What new facts would change your mind about any of the above questions?

Though it isn't a "fact", one route is to show me an abstract theory that does well at predicting civilizational responses, especially the parts that are not high profile (e.g. I want a theory that doesn't just explain COVID response -- it also explains seatbelt regulations, how property taxes are set, and why knives aren't significantly regulated, etc).

What were the causal nodes in your history that led you to form your opinions about the above?

Those were a lot of pretty different opinions, each of which has lots of different causal nodes. I currently don't see a big category that influenced all of them, so I think I'm going to punt on this question.

If you imagine turning out to be wrong/confused in some deep way about any of the above, do you have a sense of why that would turn out to be?

I only looked at things that can be observed online, rather than looking at all the other different ways that humans interact with the world; those other ways would have demonstrated obvious flaws in my answers.

Replies from: Raemon
comment by Raemon · 2022-03-03T09:53:03.381Z · LW(p) · GW(p)

Thanks. I wasn't super satisfied with the way I phrased my questions. I just made some slight edits to them (labeled as such), although they still don't feel like they quite do the thing. (I feel like I'm looking at a bunch of subtle frame disconnects, while multiple other frame disconnects are going on, so pinpointing the thing is hard_

I think "is any of this actually cruxy" is maybe the most important question and I should have included it. You answered "not supermuch, at least compared to models of intelligence". Do you think there's any similar nearby thing that feels more relevant on your end?

In any case, thanks for your answers, they do help give me more a sense of the gestalt of your worldview here, however relevant it is.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-10T01:25:25.551Z · LW(p) · GW(p)

It's definitely cruxy in the sense that changing my opinions on any of these would shift my p(doom) some amount.

My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes. This seems primarily affected by the quality of technical alignment solutions, but certainly civilizational adequacy also affects the answer.

comment by Raemon · 2022-03-02T03:46:59.559Z · LW(p) · GW(p)

There's something I had interpreted the original CEV paper to be implying, but wasn't sure if it was still part of the strategic landscape, which was "have the alignment project being working towards a goal that was highly visibly fair, to disincentive races." Was that an intentional part of the goal, or was it just that CEV seemed something like "the right thing to do" (independent of it's impact on races?)

How does Eliezer think about it now?

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-02T04:48:42.147Z · LW(p) · GW(p)

Yes, it was an intentional part of the goal.

If there were any possibility of surviving the first AGI built, then it would be nice to have AGI projects promising to do something that wouldn't look like trying to seize control of the Future for themselves, when, much later (subjectively?), they became able to do something like CEV.  I don't see much evidence that they're able to think on the level of abstraction that CEV was stated on, though, nor that they're able to understand the 'seizing control of the Future' failure mode that CEV is meant to prevent, and they would not understand why CEV was a solution to the problem while 'Apple pie and democracy for everyone forever!' was not a solution to that problem.  If at most one AGI project can understand the problem to which CEV is a solution, then it's not a solution to races between AGI projects.  I suppose it could still be a solution to letting one AGI project scale even when incorporating highly intelligent people with some object-level moral disagreements.

comment by Raemon · 2022-03-04T08:49:14.739Z · LW(p) · GW(p)

To what extent do you think pivotal-acts-in-particular are strategically important (i.e. "successfully do a pivotal act, and if necessary build an AGI to do it" is the primary driving goal), vs "pivotal acts are useful shorthand to refer to the kind of intelligence level where it matters than an AGI be 'really safe'".

I'm interested in particular in responses from Eliezer, Rohin, and perhaps Richard Ngo. (I've had private chats with Rohin that I thought were useful to share and this comment is sort of creating a framing device for sharing them, but I've been kind of confused about this throughout the sequence)

My model of Eliezer thinks that they are both strategically important, and separately useful for forcing the conversations about intelligence to deal with "meaningful superintelligence". But I think confusion about which-thing-we're-talking-about might have contributed to talking-past-each-other in the Eliezer/Richard case.

Replies from: RobbBB, rohinmshah
comment by Rob Bensinger (RobbBB) · 2022-03-04T19:49:18.236Z · LW(p) · GW(p)

My Eliezer-model thinks pivotal acts are genuinely, for-real, actually important. Like, he's not being metaphorical or making a pedagogical point when he says (paraphrasing) 'we need to use the first AGI systems to execute a huge, disruptive, game-board-flipping action, or we're all dead'.

When my Eliezer-model says that the most plausible pivotal acts he's aware of involve capabilities roughly at the level of 'develop nanotech' or 'put two cellular-identical strawberries on a plate', he's being completely literal. If some significantly weaker capability level realistically suffices for a pivotal act, then my Eliezer-model wants us to switch to focusing on that (far safer) capability level instead.

If we can save the world before we get anywhere near AGI, then we don't necessarily have to sort out how consequentialist, dangerous, hardware-overhang-y, etc. the first AGI systems will be. We can just push the 'End The Acute Existential Risk Period' button, and punt most other questions to the non-time-pressured Reflection that follows.

comment by Rohin Shah (rohinmshah) · 2022-03-06T12:33:04.906Z · LW(p) · GW(p)

The goal is to bring x-risk down to near-zero, aka "End the Acute Risk Period". My usual story for how we do this is roughly "we create a methodology for building AI systems that allows you to align them at low cost relative to the cost of gaining capabilities; everyone uses this method, we have some governance / regulations to catch any stragglers who aren't using it but still can make dangerous systems".

If I talk to Eliezer, I expect him to say "yes, in this story you have executed a pivotal act, via magical low-cost alignment that we definitely do not get before we all die". In other words, the crux is in whether you can get an alignment solution with the properties I mentioned (and maybe also in whether people will be sensible enough to use the method + do the right governance). So with Eliezer I end up talking about those cruxes, rather than talking about "pivotal acts" per se, but I'm always imagining the "get an alignment solution, have everyone use it" plan.

When I talk to people who are attempting to model Eliezer, or defer to Eliezer, or speaking out of their own model that's heavily Eliezer-based, and I present this plan to them, and then they start thinking about pivotal acts, they do not say the thing Eliezer says above. I get the sense that they see "pivotal act" as some discrete, powerful, gameboard-flipping action taken at a particular point in time that changes x-risk from non-trivial to trivial, rather than as a referent to the much broader thing of "whatever ends the acute risk period". My plan doesn't involve anything as discrete and powerful as "melt all the GPUs", so from their perspective, a pivotal act hasn't happened, and the cached belief is that if a pivotal act hasn't happened, then we all die, therefore my plan leads to us all dying. With those people I end up talking about how "pivotal act" is a referent to the goal of "End the Acute Risk Period" and if you achieve that you have won and there's nothing else left to do; it doesn't matter that it wasn't "discrete" or "gameboard-flipping".

To answer the original question: if by "pivotal act" you mean "anything that ends the acute risk period", then I think that pivotal acts are strategically important. If by "pivotal act" you mean "discrete, powerful, gameboard-flipping actions", then I'm not all that interested in it but it seems fine to use it as a referent to the kind of intelligence level where it really matters that AGI is safe.

Replies from: steve2152, Algon
comment by Steven Byrnes (steve2152) · 2022-03-06T15:13:56.043Z · LW(p) · GW(p)

Huh. I'm under the impression that "offense-defense balance for technology-inventing AGIs" is also a big cruxy difference between you and Eliezer.

Specifically: if almost everyone is making helpful aligned norm-following AGIs, but one secret military lab accidentally makes a misaligned paperclip maximizer, can the latter crush all competition? My impression is that Eliezer thinks yes: there's really no defense against self-replicating nano-machines, so the only paths to victory are absolutely perfect compliance forever (which he sees as implausible, given secret military labs etc.) or someone uses an aligned AGI to do a drastic-seeming pivotal act in the general category of GPU-melting nanobots. Whereas you disagree.

Sorry if I'm putting words in anyone's mouths.

For my part, I don't have an informed opinion about offense-defense balance, i.e. whether more-powerful-and-numerous aligned AGIs can defend against one paperclipper born in a secret military lab accident. I guess I'd have to read Drexler's nano book or something. At the very least, I don't see it as a slam dunk in favor of Team Aligned [LW(p) · GW(p)], I see it as a question that could go either way.

Replies from: rohinmshah, None
comment by Rohin Shah (rohinmshah) · 2022-03-07T12:30:17.503Z · LW(p) · GW(p)

I agree that is also moderately cruxy (but less so, at least for me, than "high-capabilities alignment is extremely difficult").

comment by [deleted] · 2022-03-06T21:42:14.218Z · LW(p) · GW(p)

One datapoint I really liked about this: https://arxiv.org/abs/2104.03113 (Scaling Laws for Board Games). They train AlphaGo agents of different sizes to compete on the game Hex. The approximate takeaway, quoting the author: “if you are in the linearly-increasing regime [where return on compute is nontrivial], then you will need about 2× as much compute as your opponent to beat them 2/3 of the time.”

This might suggest that, absent additional asymmetries (like constraints on the aligned AIs massively hampering them), the win ratio may be roughly proportional to the compute ratio. If you assume we can get global data center governance, I’d consider that a sign in favor of the world’s governments. (Whether you think that’s good is a political stance that I believe folks here may disagree on.)

Bonus quote: “This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2/3. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.”

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-03-07T01:03:33.712Z · LW(p) · GW(p)

Offense is favored over defense because, e.g., one AI can just nuke the other. The asymmetries come from physics, where you can't physically build shields that are more resilient than the strongest shield-destroying tech. Absent new physics, extra intelligence doesn't fundamentally change this dynamic, though it can buy you more time in which to strike first.

(E.g., being smarter may let you think faster, or may let you copy yourself to more locations so it takes more time for nukes or nanobots to hit every copy of you. But it doesn't let you build a wall where you can just hang out on Earth with another superintelligence and not worry about the other superintelligence breaking your wall.)

Replies from: None
comment by [deleted] · 2022-03-07T15:42:41.382Z · LW(p) · GW(p)

I want to push back on your "can't make an unbreakable wall" metaphor. We have an unbreakable wall like that today where two super-powerful beings are just hanging out sharing earth; it's called the survivable nuclear second-strike capability. 

(For clarity, here I'll assume that aligned AGI-cohort A and unaligned AGI-cohort B have both FOOMed and have nanotech.) There isn't obviously an infinite amount of energy available for B to destroy every last trace of A. This is just like how in our world, neither the US nor Russia have enough resources to have certainty that they could destroy all of their opponents' nuclear capabilities in a first strike. If any of the Americans' nuclear capabilities survive a Russian first strike, those remaining American forces' objective switches from "uphold the constitution" to "destroy the enemy no matter the cost, to follow through on tit-for-tat". Humans are notoriously bad at this kind of precommitment-to-revenge-amid-the-ashes-of-civilization, but AGIs/their nanotech can probably be much more credible.

Note the key thing here: once B attempts to destroy A, A is no longer "bound" by the constraints of being an aligned agent. Its objective function switches to being just as ruthless (or moreso) as B, and so raw post-first-strike power/intelligence on each side becomes a much more reasonable predictor of who will win.

If B knows A is playing tit-for-tat, and A has done the rational thing of creating a trillion redundant copies of itself (each of which will also play tit-for-tat) so they couldn't all be eliminated in one strike without prior detection, then B has a clear incentive not to pick a fight it is highly uncertain it can win.

One counterargument you might have: maybe offensive/undetectable nanotech is strictly favored over defensive/detection nanotech. If you assign nontrivial probability to the statement: "it is possible to destroy 100% of a nanotech-wielding defender with absolutely no previously-detectable traces of offensive build-up, even though the defender had huge incentives to invest in detection", then my argument doesn't hold. I'd definitely be interested in your (or others') justification as to why.

Consequences of this line of argument:

  • FOOMing doesn't imply unipolarity - can have multiple AGIs over the long-term, some aligned and some not.
  • Relative resource availability of these AGIs may be a valid proxy for their leverage in the implicit negotiation they'll constantly carry out instead of going to war. (Note that this is actually how great-power conflicts get settled in our current world!)
  • Failure derives primarily from a solo unaligned AGI FOOMing first. Thus, investing in detection capabilities is paramount.
  • This vision of the future also makes the incentives less bad for race-to-AGI actors than the classical "pivotal act" of whoever wins taking over the world and installing a world government. Actors might be less prone to an arms-race if they don't think of it as all-or-nothing. This also implies additional coordination strategies, like the US and China agreeing to copy their balance-of-power into the AGI age (e.g. via implicit nuclear-posture negotiation), with the resulting terms including invasive disclosure on each others' AGI projects/alignment strategies.
  • On the other hand, this means that if we have at least one unaligned AGI FOOM, Moloch may follow us to the stars.

Very interested to hear feedback! (/whether I should also put this somewhere else.)

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-03-07T18:24:23.230Z · LW(p) · GW(p)

Yeah, I wanted to hear your actual thoughts first, but I considered going into four possible objections:

  1. If there's no way to build a "wall", perhaps you can still ensure a multipolar outcome via the threat of mutually assured destruction.
  2. If MAD isn't quite an option, perhaps you can still ensure a multipolar outcome via "mutually assured severe damage": perhaps both sides would take quite a beating in the conflict, such that they'll prefer to negotiate a truce rather than actually attack each other.
  3. If an AGI wanted to avoid destruction, perhaps it could just flee into space at some appreciable fraction of the speed of light.
  4. In principle, it should be possible to set up MAD, or set up a tripwire that destroys whichever AGI tries to aggress first. E.g., just design the two AGIs yourself, and have a deep enough understanding of their brains that you can stably make them self-destruct as soon as their brain even starts thinking of ways to attack the other AGI (or to self-modify to evade the tripwire, etc.). And since this is possible in principle, perhaps we can achieve a "good enough" version of this in practice.

I don't think MAD is an option. "MAD" in the case of humans really means "Mutually Assured Heavy Loss Of Life Plus Lots Of Infrastructure Damage". MAD in real life doesn't assume that a specific elected official will die in the conflict, much less that all humans will die.

For MAD to work with AGI systems, you'd need to ensure that both AGIs are actually destroyed in arbitrary conflicts, which seems effectively impossible. (Both sides can just launch back-ups of themselves into space.)

With humans, you can bank on the US Government (treated as an agent) having a sentimental attachment to it citizens, such that it doesn't want to trade away tons of lives for power. Also, a bruised and bloodied US Government that just survived an all-out nuclear exchange with Russia would legitimately have to worry about other countries rallying against it in its weakened, bombed-out state.

You can't similarly bank on arbitrary AGIs having a sentimental attachment to anything on Earth (such that they can be held hostage by threats of damage to Earth), nor can you bank on arbitrary AGIs being crippled by conflicts they survive.

Option 2 seems more plausible, but still not very plausible. The amount of resources you can lose in a war on the scale of the Earth is just very small compared to the amount of resources at stake in the conflict. Values handshakes seem more plausible if two mature superintelligences meet in space, after already conquering large parts of the universe; then an all-out war might threaten enough of the universe's resources to make both parties wary of conflict.

I don't know how plausible option 3 is, but it seems like a fail condition regardless: spending the rest of your life fleeing from a colonization wave as fast as possible, with no time to gather resources or expand into your own thriving intergalactic civilization, means giving up nearly all of the future's value and surrendering the cosmic endowment.

4 seems extremely difficult to do, and very strange to even try to do. If you have that much insight into your AGI's cognition, you've presumably solved the alignment problem already and can stop worrying about all these complicated schemes. And long before one AGI could achieve such guarantees about another AGI (much less both achieve those guarantees about each other, somehow simultaneously?!), it would be able to proliferate nanotech to destroy any threats (that haven't fled at near-light-speed, at least).

B has a clear incentive not to pick a fight it is highly uncertain it can win.

I don't expect enough uncertainty for this. If the two sides in a dispute aren't uncertain about who would win, then the stronger side will unilaterally choose to fight (though the weaker side obviously wouldn't).

Replies from: None
comment by [deleted] · 2022-03-07T19:46:13.949Z · LW(p) · GW(p)

Agree that option-1 (literal destruction) is implausible.

Option 2 is much more likely primarily because who wins the contest is (in my model) sufficiently uncertain that in-expectation war would constitute large value destruction for the winner. In other words, if choosing "war" has a [30% probability of losing 99% of my utility over the next billion years, and a 70% probability of losing 0% of my utility], whereas choosing peace has [100% chance of achieving 60% of my utility] (assuming some positive-sum nature from the overlap of respective objective functions), then the agents choose peace.

But this does depend on the existence of meaningful uncertainty even post-FOOM. What is your reasoning for why uncertainty would be so unlikely? 

Even in boardgames like Go (with a much more constrained strategy-space than reality) it is computationally impossible to consider all possible future opponent strategies, and thus with a near-peer adversary action-values still have high uncertainty. Do you just think that "game theory that allows an AGI to compute general-equilibrium solutions and certify dominant strategies for as-complex-as-AGI-war multi-agent games" is a computationally-tractable thing for an earth-bound AGI? 

If that's a crux, I wonder if we can find some hardness proofs of different games and see what it looks like on simpler environments.

EDIT: consider even the super-simple risk that B tries to destroy A, but A manages to send out a couple near-light-speed probes into the galaxy/nearby galaxies just to inform any other currently-hiding-AGIs about B's historical conduct/untrustworthiness/refusal to live-and-let-live. If an alien-AGI C ever encounters such a probe, it would update towards non-cooperation enough to permanently worsen B-C relations should they ever meet. In this sense, your permanent loss from war becomes certain, if the AGI has ongoing nonzero probability of possibly encountering alien superintelligences.

comment by Algon · 2022-03-06T12:54:18.675Z · LW(p) · GW(p)

When I talk to people who are attempting to model Eliezer, or defer to Eliezer, or speaking out of their own model that's heavily Eliezer-based, and I present this plan to them, and then they start thinking about pivotal acts, they do not say the thing Eliezer says above. I get the sense that they see "pivotal act" as some discrete, powerful, gameboard-flipping action taken at a particular point in time that changes x-risk from non-trivial to trivial, rather than as a referent to the much broader thing of "whatever ends the acute risk period". 

 

I'm confused. I think Eliezer means roughly that (alignment isn't trivialises, we  e.g. push back the deadline) and I wouldn't go "ah yes, we're doomed if things go according to Rohin's projections". If alignment does generalise easily, then your plan is fine and we don't need a pivotal act.  Now I don't think it will be that easy, but that's a different matter as to whetheror your agenda fails because it doesn't do a gameboard flipping action. 

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-06T14:06:39.943Z · LW(p) · GW(p)

I agree that the things I model people as believing are not in fact true. (And that not everyone in the categories I outlined believes those things.) I'm not sure if you're saying anything more than that, if so, can you try saying it again with different words?

Replies from: Algon
comment by Algon · 2022-03-06T15:46:32.583Z · LW(p) · GW(p)

No, I guess not. I just put myself in the class of people "heavily influenced by Eliezer" and thought that I don't model things like that so I wanted to push back a touch. Though I suppose I should have noticed I wasn't being referred to when you said "defers to Eliezer".

comment by Matthew Barnett (matthew-barnett) · 2022-03-02T23:37:23.375Z · LW(p) · GW(p)

This question is not directed at anyone in particular, but I'd want to hear some alignment researchers answer it. As a rough guess, how much would it affect your research—in the sense of changing your priorities, or altering your strategy of impact, and method of attack on the problem—if you made any of the following epistemic updates?

(Feel free to disambiguate anything here that's ambiguous or poorly worded.)

  1. You update to think that AI takeoff will happen twice as slowly as your current best-estimate. e.g. instead of the peak-rate of yearly GWP growth being 50% during the singularity, you learn it's only going to be 25%. Alternatively, update to think that AI takeoff will happen twice as quickly, e.g. the peak-rate of GWP growth will be 100% rather than 50%.
  2. You learn that transformative AI will arrive in half the time you currently think it will take, from your current median, e.g. in 15 years rather than 30. Alternatively, you learn that transformative AI will arrive in twice the time you currently think it will take.
  3. You learn that power differentials will be twice as imbalanced during AI takeoff compared to your current median. That is, you learn that if we could measure the relative levels of "power" for agents in the world, the gini coefficient for this power distribution will be twice as unequal than your current median scenario, in the sense that world dynamics will look more unipolar than multipolar; local, rather than global. Alternatively, you learn the opposite.
  4. You learn that the cost of misalignment is half as much as you thought, in the sense that slightly misaligned AI software impose costs that are half as much (ethically, or economically), compared to what you used to think. Alternatively, you learn the opposite.
  5. You learn that the economic cost of aligning an AI to your satisfaction is half as much as you thought, for a given AI, e.g. it will take a team of 20 full-time workers writing test-cases, as opposed to 40 full-time workers of equivalent pay. Alternatively, you learn the opposite.
  6. You learn that the requisite amount of "intelligence" needed to discover a dangerous x-risk inducing technology is half as much as you thought, e.g. someone with an IQ of 300, as opposed to 600 (please interpret charitably) could by themselves figure out how to deploy full-scale molecular nanotechnology, of the type required to surreptitiously inject botulinum toxin into the bloodstreams of everyone, making us all fall over and die. Alternatively, you learn the opposite.
  7. You learn that either government or society in general will be twice as risk-averse, in the sense of reacting twice as strongly to potential AI dangers, compared to what you currently think. Alternatively, you learn the opposite.
  8. You learn that the AI paradigm used during the initial stages of the singularity—when the first AGIs are being created—will be twice as dissimilar from the current AI paradigm, compared to what you currently think. Alternatively, you learn the opposite.
Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-03T10:28:06.594Z · LW(p) · GW(p)

In all cases, the real answer is "the actual impact will depend a ton on the underlying argument that led to the update; that argument will lead to tons of other updates across the board".

I imagine that the spirit of the questions is that I don't perform a Bayesian update and instead do more of a "causal intervention" on the relevant node and propagate downstream. In that case:

  1. I'm confused by the question. If the peak rate of GWP growth ever is 25%, it seems like the singularity didn't happen? Nonetheless, to the extent this question is about updates on the quality or duration of the singularity (as opposed to the leadup to it), I don't think this affects my actions at all.
  2. I'm often acting based on my 10%-timelines, so if you tell me that TAI comes at exactly the midway point between now and my current median, that can counterintuitively have the same effect as lengthening my timelines. (Also I start sketching out far more concrete plans given this implausibly precise knowledge of when TAI comes.) So I'm instead going to answer the question where we imagine that my entire probability distribution is compressed / stretched along the time axis by a factor of 2. If compressed (shorter timelines), probably not much changes; I care more about having influence on AGI actors but I'm already in a great place for that. If stretched (longer timelines), I maybe focus more on weirder alignment theory, e.g. perhaps I work at ARC.
  3. Not much effect. In more unipolar worlds, I spend more time predicting which labs will develop AGI, so that I can be there at crunch time; in more multipolar worlds I spend less time doing that.
  4. No effect. Averting 50% of an existential catastrophe is still really good.
  5. Similar effects as (2), but with much smaller magnitude. (Lower cost => more focus on weird alignment theory, since influence becomes less useful.)
  6. This one particularly feels like I would be making some big Bayesian update (e.g. I think Eliezer's view predicts this much more strongly?) but if I ignore that then no effect.
  7. Similar effects as (2), similar magnitudes as well. (More risk-averse => getting the right solution becomes more important than having influence.)
  8. More dissimilar => more focus on weirder but more general work (e.g. less language models, more ELK). (This isn't the same as 2, 5, and 7 because this doesn't change how much I care about influence.)
Replies from: matthew-barnett, None
comment by Matthew Barnett (matthew-barnett) · 2022-03-13T16:54:34.167Z · LW(p) · GW(p)

Thanks for your response. :)

I'm confused by the question. If the peak rate of GWP growth ever is 25%, it seems like the singularity didn't happen?

I'm a little confused by your confusion. Let's say you currently think the singularity will proceed at a rate of R. The spirit of what I'm asking is: what would you change if you learned that it will proceed at a rate of one half R. (Maybe plucking specific numbers about the peak-rate of growth just made things more confusing). For me at least, I'd probably expect a lot more oversight, as people have more time to adjust to what's happening in the world around them.

No effect. Averting 50% of an existential catastrophe is still really good.

I'm also a little confused about this. My exact phrasing was, "You learn that the cost of misalignment is half as much as you thought, in the sense that slightly misaligned AI software impose costs that are half as much (ethically, or economically), compared to what you used to think." I assume you don't think that slightly misaligned software will, by default, cause extinction, especially if it's acting alone and is economically or geographically isolated.

We could perhaps view this through an analogy. War is really bad: so bad that maybe it will even cause our extinction (if say, we have some really terrible nuclear winter). But by default, I don't expect war to cause humanity to go extinct. And so, if someone asked me about a scenario in which the costs of war are only half as much as I thought, it would probably significantly update me away from thinking we need to take actions to prevent war. The magnitude of this update might not be large, but understanding exactly how much we'd update and change our strategy in light of this information is type of thing I'm asking for.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-15T13:03:35.964Z · LW(p) · GW(p)

Let's say you currently think the singularity will proceed at a rate of R.

What does this mean? On my understanding, singularities don't proceed at fixed rates?

I agree that in practice there will be some maximum rate of GDP growth, because there are fundamental physical limits (and more tight in-practice limits that we don't know), but it seems like they'll be way higher than 25% per year. Or to put it a different way, at 25% max rate I think it stops deserving the term "singularity", it seems like it takes decades and maybe centuries to reach technological maturity at that rate. (Which could totally happen! Maybe we will move very slowly and cautiously! I don't particularly expect it though.)

If you actually mean halving the peak rate of GDP growth during the singularity, and a singularity actually happens, then I think it doesn't affect my actions at all; all of the relevant stuff happened well before we get to the peak rate.

If you ask me to imagine "max rates at orders of magnitude where Rohin would say there was no singularity", then I think I pivot my plan for impact into figuring out how exactly we are going to manage to coordinate to move slowly even when there's tons of obvious value lying around, and then trying to use the same techniques to get tons and tons of oversight on the systems we build.

You learn that the cost of misalignment is half as much as you thought, in the sense that slightly misaligned AI software impose costs that are half as much (ethically, or economically), compared to what you used to think.

Hmm, I interpreted "cost of misalignment" as "expected cost of misalignment", as the standard way to deal with probabilistic things, but it sounds like you want something else.

Let's say for purposes of argument I think 10% chance of extinction, and 90% chance of "moderate costs but nothing terrible". Which of the following am I supposed to have updated to?

  1. 5% extinction, 95% moderate costs
  2. 5% extinction, 45% moderate costs, 50% perfect world
  3. 10% extinction, 90% mild costs
  4. 10% outcome-half-as-bad-as-extinction, 90% mild costs
  5. 0% extinction, 100% mild costs

I was imagining (4), but any of (1) - (3) would also not change my actions; it sounds like you want me to imagine (5) but in that case I just completely switch out of AI alignment and work on something else, but that's because you moved p(extinction) from 10% to 0%, which is a wild update to have made and not what I would call "the cost of misalignment is half as much as I thought".

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2022-03-15T19:10:00.454Z · LW(p) · GW(p)

I think you sufficiently addressed my confusion, so you don't need to reply to this comment, but I still had a few responses to what you said.

What does this mean? On my understanding, singularities don't proceed at fixed rates?

No, I agree. But growth is generally measured over an interval. In the original comment I proposed the interval of one year during the peak rate of economic growth. To allay your concern that a 25% growth rate indicates we didn't experience a singularity, I meant that we were halving the growth rate during the peak economic growth year in our future, regardless of whether that rate was very fast.

I agree that in practice there will be some maximum rate of GDP growth, because there are fundamental physical limits (and more tight in-practice limits that we don't know), but it seems like they'll be way higher than 25% per year.

The 25% figure was totally arbitrary. I didn't mean it as any sort of prediction. I agree that an extrapolation from biological growth implies that we can and should see >1000% growth rates eventually, though it seems plausible that we would coordinate to avoid that.

If you actually mean halving the peak rate of GDP growth during the singularity, and a singularity actually happens, then I think it doesn't affect my actions at all; all of the relevant stuff happened well before we get to the peak rate.

That's reasonable. A separate question might be about whether the rate of growth during the entire duration from now until the peak rate will cut in half.

Let's say for purposes of argument I think 10% chance of extinction, and 90% chance of "moderate costs but nothing terrible". Which of the following am I supposed to have updated to?

I think the way you're bucketing this into "costs if we go extinct" and "costs if we don't go extinct" is reasonable. But one could also think that the disvalue of extinction is more continuous with disvalue in non-extinction scenarios, which makes things a bit more tricky. I hope that makes sense.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-18T14:50:18.343Z · LW(p) · GW(p)

Cool, that all makes sense.

But one could also think that the disvalue of extinction is more continuous with disvalue in non-extinction scenarios, which makes things a bit more tricky.

I'm happy to use continuous notions (and that's what I was doing in my original comment) as long as "half the cost" means "you update such that the expected costs of misalignment according to your probability distribution over the future are halved". One simple way to imagine this update is to take all the worlds where there was any misalignment, halve their probability, and distribute the extra probability mass to worlds with zero costs of misalignment. At which point I reason "well, 10% extinction changes to 5% extinction, I don't need to know anything else to know that I'm still going to work on alignment, and given that, none of my actions are going to change (since the relative probabilities of different misalignment failure scenarios remain the same, which is what determines my actions within alignment)".

I got the sense from your previous comment that you wanted me to imagine some different form of update and I was trying to figure out what.

comment by [deleted] · 2022-03-09T16:25:54.045Z · LW(p) · GW(p)

I'm often acting based on my 10%-timelines

Good to hear [EA · GW]! What are your 10% timelines?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-10T01:28:53.947Z · LW(p) · GW(p)

Idk, maybe 2030 for x-risky systems?

comment by Karl von Wendt · 2022-03-01T13:13:51.419Z · LW(p) · GW(p)

A question for Eliezer: If you were superintelligent, would you destroy the world? If not, why not?

If your answer is "yes" and the same would be true for me and everyone else for some reason I don't understand, then we're probably doomed. If it is "no" (or even just "maybe"), then there must be something about the way we humans think that would prevent world destruction even if one of us were ultra-powerful. If we can understand that and transfer it to an AGI, we should be able to prevent destruction, right?

Replies from: Eliezer_Yudkowsky, Vaniver, AprilSR
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-01T17:50:14.914Z · LW(p) · GW(p)

I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.

From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me, as the result of my being strictly outer-optimized over millions of generations for inclusive genetic fitness, which I now don't care about at all.

Paperclip-numbers do well out of paperclip-number maximization. The hapless outer creators of the thing that weirdly ends up a paperclip maximizer, not so much.

Replies from: Karl von Wendt
comment by Karl von Wendt · 2022-03-01T18:13:44.406Z · LW(p) · GW(p)

"my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me

This may not be what evolution had "in mind" when it created us. But couldn't we copy something like this into a machine so that it "thinks" of us (and our descendants) as its "fellow humans" who should "get nice stuff"? I understand that we don't know how to do that yet. But the fact that Eliezer has some kind of "don't destroy the world from a fellow human perspective" goal function inside his brain seems to mean a) that such a function exists and b) that it can be coded into a neuronal network, right?

I was also thinking about the specific way we humans weigh competing goals and values against each other. So while for instance we do destroy much of the biosphere by blindly pursuing our misaligned goals, some of us still care about nature and animal welfare and rain forests, and we may even be able to prevent total destruction of them. 

Replies from: ESRogs
comment by ESRogs · 2022-03-01T18:49:20.503Z · LW(p) · GW(p)

I think we (mostly) all agree that we want to somehow encode human values into AGIs. That's not a new idea. The devil is in the details.

Replies from: Karl von Wendt, Karl von Wendt
comment by Karl von Wendt · 2022-03-02T08:02:03.914Z · LW(p) · GW(p)

I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of "value learning". (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don't contain any specific ideas on how to implement this. But maybe Muraven has a point here.

Replies from: TurnTrout, ESRogs
comment by TurnTrout · 2022-07-14T02:32:40.345Z · LW(p) · GW(p)

I think your question is excellent. "How does the single existing kind of generally intelligent agent form its values?" is one of the most important and neglected questions in all of alignment [LW · GW], I think. 

comment by ESRogs · 2022-03-02T19:24:02.358Z · LW(p) · GW(p)

But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of "value learning".

Ah, I see. You might be interested in this sequence [? · GW] then!

Replies from: Karl von Wendt
comment by Karl von Wendt · 2022-03-03T05:47:33.015Z · LW(p) · GW(p)

Yes, thank you!

comment by Karl von Wendt · 2022-03-01T19:02:47.878Z · LW(p) · GW(p)

Yes. But my impression so far is that anything we can even imagine in terms of a goal function will go badly wrong somehow. So I find it a bit reassuring that at least one such function that will not necessarily lead to doom seems to exist, even if we don't know how to encode it yet.

comment by Vaniver · 2022-03-01T19:50:13.099Z · LW(p) · GW(p)

I guess there's some meta-level question here that I'm interested in, as a sort of elaboration, which is something like: how do you go about balancing which meta-levels of the world to satisfy and which to destroy? [I kind of have a sense that Eliezer's answer can be guessed as an extension of the meta-ethics sequence, and so am interested both in his actual answer and other people's answers.]

For example, one might imagine a mostly-upload situation like The Metamorphosis of Prime Intellect / Friendship is Optimal / Second Life / etc., wherein everyone gets a materially abundant digital life in their shard of the metaverse, with communication heavily constrained (if nothing else, by requiring mutual consent). This, of course, discards as no-longer-relevant entities that exist on higher meta-levels; nations will be mostly irrelevant in such a world, companies will mostly stop existing, and so on.

But one could also apply the same logic a level lower. If you take Internal Family Systems / mental modules seriously, humans don't look like atomic objects, they look like a collection of simpler subagents balanced together in a sort of precarious way. (One part of you wants to accumulate lots of fat to survive the winter, another part of you wants to not accumulate lots of fat to look attractive to mates, the thing the 'human' is doing is balancing between those parts.) And so you can imagine a superintelligent system out to do right by the mental modules 'splitting them apart' in order to satisfy them separately, with one part swimming in a vat of glucose and the other inhabiting a beautiful statue, and discarding the 'balancing between the parts' system as no-longer-relevant.

Of course, applying this logic a level higher--the things to preserve are communities/nations/corporations/etc.--seems like it can quite easily be terrible for the people involved, and feels like it's preserving problems in order to maintain the relevance of traditional solutions.

comment by AprilSR · 2022-03-01T17:49:54.485Z · LW(p) · GW(p)

My nonexpert guess would be that the "superintelligent brain emulation" class of solutions has potential to work. But we'd still need to figure out how to prevent an AGI from being made until we're ready to actually implement that solution.

Replies from: Karl von Wendt
comment by Karl von Wendt · 2022-03-01T18:16:00.209Z · LW(p) · GW(p)

My hope was that maybe we can recreate the way we humans make beneficial decisions for fellow beings without simulating a complete brain. But I agree that AGI might be built before we have solved this.

Replies from: AprilSR
comment by AprilSR · 2022-03-01T20:37:53.431Z · LW(p) · GW(p)

I think doing that via, like, reinforcement learning, is well-established as a possible strategy and discarded because it probably won't generate the properties we want.

Maybe we could solve legibility enough to extract the value-assessing part out of a human brain, and then put it on a computer? This doesn't strike me as a solution but it might be a useful idea.

Replies from: Karl von Wendt
comment by Karl von Wendt · 2022-03-02T07:50:24.283Z · LW(p) · GW(p)

I was thinking more about the way psychologists try to understand the way we make decisions. I stumbled across two papers from a few years ago by such a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don't contain any specific ideas on how to implement this. But maybe Muraven has a point here. Maybe we should put more effort into understanding the way we humans deal with goals, instead of letting an AI figure it out for itself through RL or IRL.

comment by ClipMonger · 2022-12-02T09:21:01.121Z · LW(p) · GW(p)

Will there be one of these for 2022?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-12-02T12:33:30.004Z · LW(p) · GW(p)

If you post questions here, there's a decent chance I'll respond, though I'm not promising to.

comment by Signer · 2022-03-02T09:37:22.904Z · LW(p) · GW(p)

It was all very interesting, but what was the goal of these discussions? I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety. Is the reasoning that the same mental process that assigned too low probability would not be able to come up with actual solution? Or something like "at the time they think their solution reduced probability of failure from 5% to 0.1% it would still be much higher"? Seems to be only possible if people don't understand arguments about inner optimisators or what not, as opposed to disagreeing with them.

Replies from: rohinmshah, RobbBB, None
comment by Rohin Shah (rohinmshah) · 2022-03-02T19:18:26.824Z · LW(p) · GW(p)

Changing one's mind on P(doom) can be useful for people comparing across cause areas (e.g. Open Phil), but it's not all that important for me and was not one of my goals.

Generally when people have big disagreements about some high-level question like P(doom), it means that they have very different underlying models that drive their reasoning within that domain. The main goal (for me) is to acquire underlying models that I can then use in the future.

Acquiring a new underlying model that I actually believe would probably be more important than the rest of my work in a full year combined. It would typically have significant implications on what sorts of proposals can and cannot work, and would influence what research I do for years to come. In the case of Eliezer's model specifically, it would completely change what research I do, since Eliezer's model specifically predicts that the research I do is useless (I think).

I didn't particularly expect to actually acquire a new model that I believed from these conversations, but there was some probability of that, and I did expect that I would learn at least a few new things I hadn't previously considered. I'm unfortunately quite bad at noticing my own "updates", so I can't easily point to examples. That being said, I'm confident that I would now be significantly better at Eliezer's ITT than before the conversations.

comment by Rob Bensinger (RobbBB) · 2022-03-03T03:17:21.249Z · LW(p) · GW(p)

I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety.

What specific actions do you have in mind when you say "global coordination on safety", and how much of the problem do you think these actions solve?

My own view is that 'caring about AI x-risk at all' is a pretty small (albeit indispensable) step. There are lots of decisions that hinge on things other than 'is AGI risky at all'.

I agree with Rohin that the useful thing is trying to understand each other's overall models of the world and try to converge on them, not p(doom) per se. I gave some examples here [LW(p) · GW(p)] of some important implications of having more Paul-ish models versus more Eliezer-ish models.

More broadly, examples of important questions people in the field seem to disagree a lot about:

  • Alignment
    • How hard is alignment? What are the central obstacles? What kind of difficulty is it? (Is it hard like 'building a secure OS that works on the first try'? Hard like 'the engineering/logistics/implementation portion of the Manhattan Project'? Both? Some other option? Etc.)
    • What alignment research directions are potentially useful, and what plans for developing aligned AGI have a chance of working?
  • Deployment
    • What should the first AGI systems be aligned to do?
    • To what extent should we be thinking of "large disruptive act that upends the gameboard", versus "slow moderate roll-out of regulations and agreements across a few large actors"?
  • Information spread
    • How important is research closure and opsec for capabilities-synergistic ideas? (Now, later, in the endgame, etc.)
  • Path to AGI
    • Is AGI just "current SotA systems like GPT-3, but scaled up", or are we missing key insights?
    • More broadly, what's the relationship between current approaches and AGI?
    • How software- and/or hardware-bottlenecked are we on AGI?
    • How compute- and/or data-efficient will AGI systems be?
    • How far off is AGI? How possible is it to time future tech developments? How continuous is progress likely to be?
    • How likely is it that AGI is in-paradigm for deep learning?
    • If AGI comes from a new paradigm, how likely is it that it arises late in the paradigm (when the relevant approach is deployed at scale in large corporations) versus early (when a few fringe people are playing with the idea)?
    • Should we expect warning shots? Would warning shots make a difference, and if so, would they be helpful or harmful?
    • To what extent are there meaningfully different paths to AGI, versus just one path? How possible (and how desirable) is it to change which path humanity follows to get to AGI?
  • Actors
    • How likely is it that AGI is first developed by a large established org, versus a small startup-y org, versus an academic group, versus a government?
    • How likely is it that governments play a role at all? What role would be desirable, if any? How tractable is it to try to get governments to play a good role (rather than a bad role), and/or to try to get governments to play a role at all (rather than no role)?
Replies from: Signer
comment by Signer · 2022-03-03T11:14:15.035Z · LW(p) · GW(p)

Specific actions like not scaling systems with 5% probalility of catastrophe if they have control over it and explaning everyone why they shouldn't do it too. It's just that my first reaction is that indispensable steps should be a priority. And so even though reconciliation of models is certainly useful for future solution, it seemed to me less cost effective than spreading less pessimistic model, for example. Again, it is just initial feeling and I can come up with scenarios where it makes sense to focus on model convergence, but I am not sure how are you weighting these scenarios. Is it that just making everyone think like Paul is impossible, or is civilisation of Pauls would end anyway, or are you already trying to spread awareness via other channels and this discussion supposed to be solution-focused... I guess at least last is true, because https://www.lesswrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions [LW · GW] but then this discussion felt like too much about P(doom). My guess it's something like "models that assign wrong probabilites may not destroy world themselves, but would be too slow to solve alignment before someone creates AGI on desktop"? And so discussing models is not much less useful, because all known actions are unlikely to help. But would like to hear what's the plan is/was anyway.

comment by [deleted] · 2022-03-02T11:48:46.922Z · LW(p) · GW(p)
comment by Yitz (yitz) · 2022-03-01T20:59:10.693Z · LW(p) · GW(p)

Am I correct in assuming that your baseline belief right now is that alignment will not be solved before the first AGI is created? As a tangentially-related question, do you believe there is any significant likelihood that we could create a “semi-aligned” AGI (which would optimize for “less bad,” but still potentially dystopian futures) more easily than solving for full alignment? If so, how much energy should we be putting into exploring that possibility space? (Latter question adapted from the discussion around https://www.lesswrong.com/posts/wRq6cwtHpXB9zF9gj/better-a-brave-new-world-than-a-dead-one [LW · GW])

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-02T04:43:17.041Z · LW(p) · GW(p)

Nope.  It's just as hard and harder than aligning on some more limited pivotal task.  This is Sacrifice to the Gods; you imagine accepting some big downside but the big downside doesn't actually buy you anything.

Replies from: yitz, rank-biserial
comment by Yitz (yitz) · 2022-03-02T12:49:35.257Z · LW(p) · GW(p)

What do you mean by a “more limited pivotal task”? Trying to align an AGI towards “be nice to humans in this complex manner we have trouble defining ourselves” seems more limited than “merely” aligning for a semi-static dystopia which I would imagine occupies more phase space in the range of possible future worlds.

comment by rank-biserial · 2022-03-02T07:51:15.762Z · LW(p) · GW(p)

If you're sure alignment won't work....

(ctrl+f "172")

Replies from: Eliezer_Yudkowsky, None
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-03-02T08:47:00.688Z · LW(p) · GW(p)

Also futile. Sacrificing your ethics doesn't necessarily buy you anything just because you feel like you paid extra.

comment by [deleted] · 2022-03-02T11:45:34.819Z · LW(p) · GW(p)Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2022-03-02T20:55:42.020Z · LW(p) · GW(p)

I'm a little worried about being downvoted due to this being off-topic, but I have two things to share on this question. First, the philosopher Ole Martin Moen wrote a paper responding to claims in Ted Kaczsynski's famous manifesto, citing Nick Bostrom and the arguments for taking existential risk seriously. Since this paper may be the only serious academic response to Kaczynski's manifesto, I'd bet that Kaczsynski has read it. Second, Kaczynski has been diagnosed with terminal cancer, and will probably soon die.

comment by Zack_M_Davis · 2022-03-01T04:28:49.078Z · LW(p) · GW(p)

likeliest to be answering questions is Wednesday May 2

(Presumably, Wednesday March 2?)

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-03-01T04:36:21.503Z · LW(p) · GW(p)

Yep, fixed!

comment by brb243 · 2022-03-02T18:52:35.854Z · LW(p) · GW(p)

During this weekend's SERI Conference, to my understanding, Paul Christiano specified that his work focuses on preventing AI to disempower humans and disregards externalities. Whose work focuses on understanding these externalities, such as wellbeing and freedom experienced by humans and other sentience, including AI and animals? Is it possible to safely employ the AI that has the best total externalities, measured across times under the veil of ignorance? Is it necessary that overall beneficial systems are developed prior to the existence of AGI, so that it does not make decisions unfavorable to some entities? Alternatively, to strive for an overall favorability situation development with an AGI safely governed by humans otherwise dystopic scenarios could occur for some individuals?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T19:26:50.567Z · LW(p) · GW(p)

A bunch of people work on a bunch of different aspects of AI. You might be interested in AI governance and/or FATE* research (Fairness, Accountability, Transparency, Ethics). You can see some examples of the former in this spreadsheet under the AI governance category.

I don't know of anyone who works specifically on the questions you listed (but as mentioned above there are many who work on subquestions).

Replies from: brb243
comment by brb243 · 2022-03-03T19:53:31.951Z · LW(p) · GW(p)

Thank you so much!

comment by Sam Clarke · 2022-03-02T14:33:33.857Z · LW(p) · GW(p)

One argument for alignment difficulty is that corrigibility is "anti-natural" in a certain sense. I've tried to write out my understanding of this argument [LW(p) · GW(p)], and would be curious if anyone could add or improve anything about it.

I'd be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.

comment by [deleted] · 2022-03-09T15:46:09.867Z · LW(p) · GW(p)

1. Year with 10% chance of AGI?
2. P(doom|AGI in that year)?

Replies from: rohinmshah
comment by seed · 2022-03-03T03:26:45.348Z · LW(p) · GW(p)

Will MIRI want to hire programmers once the pandemic is over? What kind of programmers? What other kinds of people do you seek to hire?

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-03-04T07:57:42.625Z · LW(p) · GW(p)

For practical purposes, I'd say the pandemic is already over. MIRI isn't doing much hiring, though it's doing a little. The two big things we feel bottlenecked on are:

  • (1) people who can generate promising new alignment ideas. (By far the top priority, but seems empirically rare.)
  • (2) competent executives who are unusually good at understanding the kinds of things MIRI is trying to do, and who can run their own large alignment projects mostly-independently.

For 2, I think the best way to get hired by MIRI is to prove your abilities via the Visible Thoughts Project [LW · GW]. The post there says a bit more about the kind of skills we're looking for:

Eliezer has a handful of ideas that seem to me worth pursuing, but for all of them to be pursued, we need people who can not only lead those projects themselves, but who can understand the hope-containing heart of the idea with relatively little Eliezer-interaction, and develop a vision around it that retains the shred of hope and doesn’t require constant interaction and course-correction on our part. (This is, as far as I can tell, a version of the Hard Problem of finding good founders, but with an additional constraint of filtering for people who have affinity for a particular project, rather than people who have affinity for some project of their own devising.)

For 1, I suggest initially posting your research ideas to LessWrong, in line with John Wentworth's advice [LW · GW]. New ideas and approaches are desperately needed, and we would consider it crazy to not fund anyone whose ideas or ways-of-thinking-about-the-problem we think have a shred of hope in them. We may fund them via working at MIRI, or via putting them in touch with external funders; the important thing is just that the research happens.

If you want to work on alignment but you don't fall under category 1 or 2, you might consider applying to work at Redwood Research (https://www.redwoodresearch.org/jobs), which is a group doing alignment research we like. They're much more hungry for engineers right now than we are.

Replies from: ViktoriaMalyasova
comment by ViktoriaMalyasova · 2022-04-25T19:09:56.608Z · LW(p) · GW(p)

Thank you. Did you know that the software engineer job posting is still accessible on your website, from the https://intelligence.org/research-guide/ page, though not from the https://intelligence.org/get-involved/#careers page? And your careers page says the pandemic is still on.

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-06-01T07:17:27.086Z · LW(p) · GW(p)

I did not! Thanks for letting me know. :) Those pages are updated now.

comment by Signer · 2022-03-02T09:15:59.630Z · LW(p) · GW(p)

So, about that "any future details would make me update in one direction, so I may as well update now" move: I think it would be helpful to have a description of how it can possibly be a correct thing to do at all from Bayesian standpoint. Like, is situation supposed to be that you already have a general mechanism generating these details and just don't realise it? But then you need reasons to believe that general mechanism. Or is it just "I did such update bunch of times and it usually worked"? Or what?

Replies from: RobbBB, lhc
comment by Rob Bensinger (RobbBB) · 2022-03-02T13:01:01.451Z · LW(p) · GW(p)

Suppose there are four possibilities you're uncertain between: A, B, C, and D. A implies X, B implies X, C implies X, and D implies X. So if A/B/C/D genuinely exhaust the options, then we can conclude X without needing to know which of the four is correct.

A simple concrete example might be: 'When I'm in wishful-thinking Abstract Mode, I imagine that one of my siblings will take out the trash today. But when I actually think about each individual sibling, I remember that Alice is sleeping over at a friend's house today, and Bob never takes out the trash, and Carol is really busy with her class project, and...'

Replies from: Signer
comment by Signer · 2022-03-02T13:43:21.677Z · LW(p) · GW(p)

Thanks. Then I think what people have trouble with is constraining the space of options and not update itself, and I don't see what's the point of fraiming it like this.

comment by lhc · 2022-03-03T04:31:28.672Z · LW(p) · GW(p)

I wrote a quick draft about this a month ago, working through the simple math of this kind of update. I've just put it up [LW · GW].

Replies from: Signer
comment by Signer · 2022-03-03T05:48:18.590Z · LW(p) · GW(p)

Awesome!

comment by Ofer (ofer) · 2022-04-04T16:31:44.737Z · LW(p) · GW(p)

I'm late to the party by a month, but I'm interested in your take (especially Rohin's) on the following:

Conditional on an existential catastrophe happening due to AI systems, what is your credence that the catastrophe will occur only after the involved systems are deployed?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-04-05T11:22:27.806Z · LW(p) · GW(p)

Idk, 95%? Probably I should push that down a bit because I haven't thought about it very hard.

It's a bit fuzzy what "deployed" means, but for now I'm going to assume that we mean that we put inputs into the AI system for the primary purpose of getting useful outputs, rather than for seeing what the AI did so that we can make it better.

Any existential catastrophe that didn't involve a failure of alignment seems like it had to involve a deployed system.

For failures of alignment, I'd expect that before you get an AI system that can break out of the training process and kill you, you get an AI system that can break out of deployment and kill you, because there's (probably) less monitoring during deployment. You're also just running much longer during deployment -- if an AI system is waiting for the right opportunity, then even if it is equally likely to happen for a training vs deployment input (i.e. ignoring the greater monitoring during training), you'd still expect to see it happen at deployment since >99% of the inputs happen at deployment.

comment by Ben Livengood (ben-livengood) · 2022-03-03T01:05:22.854Z · LW(p) · GW(p)

I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc. and where not responding at all is also a failure of alignment because the tool is useless to the organism.

Consider a planaria with relatively simple behaviors and well-known neural structure. What protocols or tests can be used to demonstrate that an AGI makes decisions aligned with planaria values?

Do we need to go simpler and achieve proof-of-concept alignment with virtual life? Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game? This isn't a straw man; a calculus for values has to handle the edge-cases too. There may be a very simple answer of moral indifference in the case of gliders but I want to be shown why the reasoning is coherent when the same calculus will be applied to other organisms.

As an important aside, will these procedures essentially reverse-engineer values by subjecting organisms to every possible input to see how they respond and try to interpret those responses, or is there truly a calculus of values we expect to discover that correctly infers values from the nature of organisms without using/simulating torture?

I have no concrete idea how to accomplish the preceding things and don't expect that anyone else does either. Maybe I'll be pleasantly surprised.

Barring this kind of fundamental accomplishment for alignment I think it's foolhardy to assume ML procedures will be found to convert human values into AGI optimization goals. We can't ask planaria or gliders what they value and we will have to reason it out from first principles, and AGI will have to do the same for us with very limited help from us if we can't even align for planaria. Claiming that planaria or gliders don't have values or that they are not complex enough to effectively communicate their values are both cop-outs. From the perspective of an AGI we humans will be just as inscrutable, if not moreso. If values are not unambiguously well-defined for gliders or planaria then what hope do we have of stumbling onto well-defined human values at the granularity of AGI optimization processes? In the best case I can imagine a distribution of values-calculuses with different answers for these simple organisms but almost identical answers for more complex organisms, but if we don't get that kind of convergence we better be able to rigorously tell the difference before we send an AGI hunting in that space for one to apply to us.

Replies from: rohinmshah, Charlie Steiner
comment by Rohin Shah (rohinmshah) · 2022-03-03T10:45:24.042Z · LW(p) · GW(p)

Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game?

Here's a scheme for glider alignment. Train your AGI using deep RL in an episodic Game of Life. At each step, it gets +1 reward every time the glider moves forward, and -1 reward if not (i.e. it was disrupted somehow).

I assume you are not happy with this scheme because the we arbitrarily picked out values for gliders?

Good news! While gliders can't provide the rewards to the AGI (in order to define their values themselves), humans can provide rewards to the AI systems they are training (in order to define their values themselves). This is the basic (relevant) difference between humans and gliders, and is why I think primarily about alignment to humans and not alignment to simpler systems.

(This doesn't get into potential problems with inner alignment and informed oversight but I don't think those change the basic point.)

comment by Charlie Steiner · 2022-03-07T15:28:33.597Z · LW(p) · GW(p)

Hm, upon more thought I actually kind of endorse this as a demo. I think we should be able to run an alignment scheme on c. elegans and get out a universe full of well-fed worms, and that's a decent sign that we didn't screw up, despite the fact that it doesn't engage with several key problems that arise in humans because we're more complicated, have preferences about pur preferences, etc. No weird worm-stimulation should be needed. But we do have to accept that we're not getting some notion of values independent of an act of interpretation.

comment by [deleted] · 2022-03-02T07:58:01.573Z · LW(p) · GW(p)Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-02T21:20:35.331Z · LW(p) · GW(p)

I feel a bit confused by these questions, but to take a stab at them anyway:

Are there projects (or even megaprojects [EA · GW]) that we can undertake, that can increase probability of noticeable GDP impact before we run AI in dangerous regimes? 

Sure. Encourage people to use the apply the best current AI techniques in actual applications. This leads to more applications before the dangerous regime. It also plausibly hastens the dangerous regime (because the dangerous thing gets deployed as an application more quickly).

Does getting noticeable GDP impact before we run AI in dangerous regimes, help to reduce actual likelihood of us running AI in dangerous regime?

One useful effect is that it increases the probability of a warning shot, and the severity of the warning shots that we do get. One harmful effect (of the action I suggested, not necessarily of the overall goal) is that it normalizes "deploy AI as fast as possible", including risky / dangerous AI.

I'd guess that the harms exceed the benefits.

Replies from: None
comment by [deleted] · 2022-03-03T04:29:28.118Z · LW(p) · GW(p)Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2022-03-03T10:47:34.329Z · LW(p) · GW(p)

Yeah, I think you could do that, though I'd guess it isn't as efficient as encouraging other people to do this (that has significantly more leverage, since you're affecting many companies, not just one you make yourself).

Replies from: None
comment by [deleted] · 2022-03-03T15:15:26.436Z · LW(p) · GW(p)