rohinmshah's Shortform 2020-01-18T23:21:02.302Z · score: 7 (1 votes)
[AN #82]: How OpenAI Five distributed their training computation 2020-01-15T18:20:01.270Z · score: 19 (5 votes)
[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment 2020-01-08T18:00:01.566Z · score: 22 (8 votes)
[AN #80]: Why AI risk might be solved without additional intervention from longtermists 2020-01-02T18:20:01.686Z · score: 33 (15 votes)
[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL 2020-01-01T18:00:01.839Z · score: 12 (5 votes)
[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison 2019-12-26T01:10:01.626Z · score: 26 (7 votes)
[AN #77]: Double descent: a unification of statistical theory and modern ML practice 2019-12-18T18:30:01.862Z · score: 21 (8 votes)
[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations 2019-12-04T18:10:01.739Z · score: 14 (6 votes)
[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee 2019-11-27T18:10:01.332Z · score: 39 (11 votes)
[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts 2019-11-20T18:20:01.647Z · score: 19 (7 votes)
[AN #73]: Detecting catastrophic failures by learning how agents tend to break 2019-11-13T18:10:01.544Z · score: 11 (4 votes)
[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety 2019-11-06T18:10:01.604Z · score: 28 (7 votes)
[AN #71]: Avoiding reward tampering through current-RF optimization 2019-10-30T17:10:02.211Z · score: 13 (5 votes)
[AN #70]: Agents that help humans who are still learning about their own preferences 2019-10-23T17:10:02.102Z · score: 18 (6 votes)
Human-AI Collaboration 2019-10-22T06:32:20.910Z · score: 39 (13 votes)
[AN #69] Stuart Russell's new book on why we need to replace the standard model of AI 2019-10-19T00:30:01.642Z · score: 64 (21 votes)
[AN #68]: The attainable utility theory of impact 2019-10-14T17:00:01.424Z · score: 19 (5 votes)
[AN #67]: Creating environments in which to study inner alignment failures 2019-10-07T17:10:01.269Z · score: 17 (6 votes)
[AN #66]: Decomposing robustness into capability robustness and alignment robustness 2019-09-30T18:00:02.887Z · score: 12 (6 votes)
[AN #65]: Learning useful skills by watching humans “play” 2019-09-23T17:30:01.539Z · score: 12 (4 votes)
[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning 2019-09-16T17:10:02.103Z · score: 11 (5 votes)
[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence 2019-09-10T19:10:01.174Z · score: 24 (8 votes)
[AN #62] Are adversarial examples caused by real but imperceptible features? 2019-08-22T17:10:01.959Z · score: 28 (11 votes)
Call for contributors to the Alignment Newsletter 2019-08-21T18:21:31.113Z · score: 39 (12 votes)
Clarifying some key hypotheses in AI alignment 2019-08-15T21:29:06.564Z · score: 68 (28 votes)
[AN #61] AI policy and governance, from two people in the field 2019-08-05T17:00:02.048Z · score: 11 (5 votes)
[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode 2019-07-22T17:00:01.759Z · score: 25 (10 votes)
[AN #59] How arguments for AI risk have changed over time 2019-07-08T17:20:01.998Z · score: 43 (9 votes)
Learning biases and rewards simultaneously 2019-07-06T01:45:49.651Z · score: 43 (12 votes)
[AN #58] Mesa optimization: what it is, and why we should care 2019-06-24T16:10:01.330Z · score: 50 (13 votes)
[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming 2019-06-05T23:20:01.202Z · score: 28 (9 votes)
[AN #56] Should ML researchers stop running experiments before making hypotheses? 2019-05-21T02:20:01.765Z · score: 22 (6 votes)
[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI 2019-05-05T02:20:01.030Z · score: 18 (6 votes)
[AN #54] Boxing a finite-horizon AI system to keep it unambitious 2019-04-28T05:20:01.179Z · score: 21 (6 votes)
Alignment Newsletter #53 2019-04-18T17:20:02.571Z · score: 22 (6 votes)
Alignment Newsletter One Year Retrospective 2019-04-10T06:58:58.588Z · score: 93 (27 votes)
Alignment Newsletter #52 2019-04-06T01:20:02.232Z · score: 20 (5 votes)
Alignment Newsletter #51 2019-04-03T04:10:01.325Z · score: 28 (5 votes)
Alignment Newsletter #50 2019-03-28T18:10:01.264Z · score: 16 (3 votes)
Alignment Newsletter #49 2019-03-20T04:20:01.333Z · score: 26 (8 votes)
Alignment Newsletter #48 2019-03-11T21:10:02.312Z · score: 31 (13 votes)
Alignment Newsletter #47 2019-03-04T04:30:11.524Z · score: 21 (5 votes)
Alignment Newsletter #46 2019-02-22T00:10:04.376Z · score: 18 (8 votes)
Alignment Newsletter #45 2019-02-14T02:10:01.155Z · score: 27 (9 votes)
Learning preferences by looking at the world 2019-02-12T22:25:16.905Z · score: 47 (13 votes)
Alignment Newsletter #44 2019-02-06T08:30:01.424Z · score: 20 (6 votes)
Conclusion to the sequence on value learning 2019-02-03T21:05:11.631Z · score: 48 (11 votes)
Alignment Newsletter #43 2019-01-29T21:10:02.373Z · score: 15 (5 votes)
Future directions for narrow value learning 2019-01-26T02:36:51.532Z · score: 12 (5 votes)
The human side of interaction 2019-01-24T10:14:33.906Z · score: 18 (5 votes)


Comment by rohinmshah on Clarifying "AI Alignment" · 2020-01-19T00:29:04.522Z · score: 2 (1 votes) · LW · GW
This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.

Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.

In the above scenario, is Alpha "motivation-aligned"

If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.

But, such a concept would depend in complicated ways on the agent's internals.

That, or it could depend on the agent's counterfactual behavior in other situations. I agree it can't be just the action chosen in the particular state.

Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).

I guess you wouldn't count universality. Overall I agree. I'm relatively pessimistic about mathematical formalization. (Probably not worth debating this point; feels like people have talked about it at length in Realism about rationality without making much progress.)

it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.

I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.

Comment by rohinmshah on Realism about rationality · 2020-01-18T23:53:38.232Z · score: 2 (1 votes) · LW · GW
I disagree with the version that replaces 'MIRI's theories' with 'mathematical theories of embedded rationality'

Yeah, I think this is the sense in which realism about rationality is an important disagreement.

But also, to the extent that your theory is mathematisable and comes with 'error bars'

Yeah, I agree that this would make it easier to build multiple levels of abstractions "on top". I also would be surprised if mathematical theories of embedded rationality came with tight error bounds (where "tight" means "not so wide as to be useless"). For example, current theories of generalization in deep learning do not provide tight error bounds to my knowledge, except in special cases that don't apply to the main successes of deep learning.

When I read a MIRI paper, it typically seems to me that the theories discussed are pretty abstract, and as such there are more levels below than above. [...] They are also mathematised enough that I'm optimistic about upwards abstraction having the possibility of robustness.


The levels below seem mostly unproblematic (except for machine learning, which in the form of deep learning is often under-theorised).

I am basically only concerned about machine learning, when I say that you can't build on the theories. My understanding of MIRI's mainline story of impact is that they develop some theory that AI researchers use to change the way they do machine learning that leads to safe AI. This sounds to me like there are multiple levels of inference: "MIRI's theory" -> "machine learning" -> "AGI". This isn't exactly layers of abstraction, but I think the same principle applies, and this seems like too many layers.

You could imagine other stories of impact, and I'd have other questions about those, e.g. if the story was "MIRI's theory will tell us how to build aligned AGI without machine learning", I'd be asking when the theory was going to include computational complexity.

Comment by rohinmshah on rohinmshah's Shortform · 2020-01-18T23:21:02.473Z · score: 4 (2 votes) · LW · GW

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability , where  is the optimal policy for getting from to , and is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.

Then, for the new paper when using a baseline, the future task value is:

where is the baseline state and is the future goal.

In a deterministic environment, this can be rewritten as:

Here, is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.

Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s'. So for a fixed time step, the first term is a constant.

The optimal value function in the new paper is (page 3, and using my notation of instead of their ):


This is the regular Bellman equation, but with the following augmented reward (here is the baseline state at time t):

Terminal states:

Non-terminal states:

For comparison, the original relative reachability reward is:

The first and third terms in are very similar to the two terms in . The second term in only depends on the baseline.

All of these rewards so far are for finite-horizon MDPs (at least, that's what it sounds like from the paper, and if not, they could be anyway). Let's convert them to infinite-horizon MDPs (which will make things simpler, though that's not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define for convenience. Then, we have:

Non-terminal states:

What used to be terminal states that are now self-loop states:

Note that all of the transformations I've done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We're ready for analysis. There are exactly two differences between relative reachability and future state rewards:

First, the future state rewards have an extra term, .

This term depends only on the baseline . For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn't matter.

For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals that involve sushi.

Second, in non-terminal states, relative reachability weights the penalty by instead of . Really since and thus is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from in non-terminal states to the smaller in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it's a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)

Summary: The actual effects of the new paper's framing 1. removes the "extra" incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.

(That said, it starts from a very different place than the original RR paper, so it's interesting that they somewhat converge here.)

Comment by rohinmshah on Realism about rationality · 2020-01-18T20:54:50.566Z · score: 4 (2 votes) · LW · GW
people reasoned in the relevant theories and then built things in the real world based on the results of that reasoning

Agreed. I'd say they built things in the real world that were "one level above" their theories.

if that's true, [...] then I'd think that spending time and effort developing the relevant theories was worth it


you seem to be pointing at something else


Overall I think these relatively-imprecise theories let you build things "one level above", which I think your examples fit into. My claim is that it's very hard to use them to build things "2+ levels above".

Separately, I claim that:

  • "real AGI systems" are "2+ levels above" the sorts of theories that MIRI works on.
  • MIRI's theories will always be the relatively-imprecise theories that can't scale to "2+ levels above".

(All of this with weak confidence.)

I think you disagree with the underlying model, but assuming you granted that, you would disagree with the second claim; I don't know what you'd think of the first.

Comment by rohinmshah on Realism about rationality · 2020-01-18T06:47:31.970Z · score: 2 (1 votes) · LW · GW

On the model proposed in this comment, I think of these as examples of using things / abstractions / theories with imprecise predictions to reason about things that are "directly relevant".

If I agreed with the political example (and while I wouldn't say that myself, it's within the realm of plausibility), I'd consider that a particularly impressive version of this.

Comment by rohinmshah on Realism about rationality · 2020-01-18T01:05:48.181Z · score: 5 (2 votes) · LW · GW

I think we disagree primarily on 2 (and also how doomy the default case is, but let's set that aside).

In claiming that rationality is as real as reproductive fitness, I'm claiming that there's a theory of evolution out there.

I think that's a crux between you and me. I'm no longer sure if it's a crux between you and Richard. (ETA: I shouldn't call this a crux, I wouldn't change my mind on whether MIRI work is on-the-margin more valuable if I changed my mind on this, but it would be a pretty significant update.)

Reproductive fitness does seem to me like the kind of abstraction you can build on, though. For example, the theory of kin selection is a significant theory built on top of it.

Yeah, I was ignoring that sort of stuff. I do think this post would be better without the evolutionary fitness example because of this confusion. I was imagining the "unreal rationality" world to be similar to what Daniel mentions below:

I think I was imagining an alternative world where useful theories of rationality could only be about as precise as theories of liberalism, or current theories about why England had an industrial revolution when it did, and no other country did instead.

But, separately, I don't get how you're seeing reproductive fitness and evolution as having radically different realness, such that you wanted to systematically correct. I agree they're separate questions, but in fact I see the realness of reproductive fitness as largely a matter of the realness of evolution -- without the overarching theory, reproductive fitness functions would be a kind of irrelevant abstraction and therefore less real.

Yeah, I'm going to try to give a different explanation that doesn't involve "realness".

When groups of humans try to build complicated stuff, they tend to do so using abstraction. The most complicated stuff is built on a tower of many abstractions, each sitting on top of lower-level abstractions. This is most evident (to me) in software development, where the abstraction hierarchy is staggeringly large, but it applies elsewhere, too: the low-level abstractions of mechanical engineering are "levers", "gears", "nails", etc.

A pretty key requirement for abstractions to work is that they need to be as non-leaky as possible, so that you do not have to think about them as much. When I code in Python and I write "x + y", I can assume that the result will be the sum of the two values, and this is basically always right. Notably, I don't have to think about the machine code that deals with the fact that overflow might happen. When I write in C, I do have to think about overflow, but I don't have to think about how to implement addition at the bitwise level. This becomes even more important at the group level, because communication is expensive, slow, and low-bandwidth relative to thought, and so you need non-leaky abstractions so that you don't need to communicate all the caveats and intuitions that would accompany a leaky abstraction.

One way to operationalize this is that to be built on, an abstraction must give extremely precise (and accurate) predictions.

It's fine if there's some complicated input to the abstraction, as long as that input can be estimated well in practice. This is what I imagine is going on with evolution and reproductive fitness -- if you can estimate reproductive fitness, then you can get very precise and accurate predictions, as with e.g. the Price equation that Daniel mentioned. (And you can estimate fitness, either by using things like the Price equation + real data, or by controlling the environment where you set up the conditions that make something reproductively fit.)

If a thing cannot provide extremely precise and accurate predictions, then I claim that humans mostly can't build on top of it. We can use it to make intuitive arguments about things very directly related to it, but can't generalize it to something more far-off. Some examples from these comment threads of what "inferences about directly related things" looks like:

current theories about why England had an industrial revolution when it did
[biology] has far more practical consequences (thinking of medicine)
understanding why overuse of antibiotics might weaken the effect of antibiotics [based on knowledge of evolution]

Note that in all of these examples, you can more or less explain the conclusion in terms of the thing it depends on. E.g. You can say "overuse of antibiotics might weaken the effect of antibiotics because the bacteria will evolve / be selected to be resistant to the antibiotic".

In contrast, for abstractions like "logic gates", "assembly language", "levers", etc, we have built things like rockets and search engines that certainly could not have been built without those abstractions, but nonetheless you'd be hard pressed to explain e.g. how a search engine works if you were only allowed to talk with abstractions at the level of logic gates. This is because the precision afforded by those abstractions allows us to build huge hierarchies of better abstractions.

So now I'd go back and state our crux as:

Is there a theory of rationality that is sufficiently precise to build hierarchies of abstraction?

I would guess not. It sounds like you would guess yes.

I think this is upstream of 2. When I say I somewhat agree with 2, I mean that you can probably get a theory of rationality that makes imprecise predictions, which allows you to say things about "directly relevant things", which will probably let you say some interesting things about AI systems, just not very much. I'd expect that, to really affect ML systems, given how far away from regular ML research MIRI research is, you would need a theory that's precise enough to build hierarchies with.

(I think I'd also expect that you need to directly use the results of the research to build an AI system, rather than using it to inform existing efforts to build AI.)

(You might wonder why I'm optimistic about conceptual ML safety work, which is also not precise enough to build hierarchies of abstraction. The basic reason is that ML safety is "directly relevant" to existing ML systems, and so you don't need to build hierarchies of abstraction -- just the first imprecise layer is plausibly enough. You can see this in the fact that there are already imprecise concepts that are directly talking about safety.)

The security mindset model of reaching high confidence is not that you have a model whose overall predictive accuracy is high enough, but rather that you have an argument for security which depends on few assumptions, each of which is individually very likely. E.G., in computer security you don't usually need exact models of attackers, and a system which relies on those is less likely to be secure.

Your few assumptions need to talk about the system you actually build. On the model I'm outlining, it's hard to state the assumptions for the system you actually build, and near-impossible to be very confident in those assumptions, because they are (at least) one level of hierarchy higher than the (assumed imprecise) theory of rationality.

Comment by rohinmshah on Exploring safe exploration · 2020-01-17T15:10:28.506Z · score: 4 (2 votes) · LW · GW
A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There's a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don't have episode boundaries.
Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

My main reason for making the separation is that in every deep RL algorithm I know of there is exploration-that-is-incentivized-by-gradient-descent and exploration-that-is-not-incentivized-by-gradient-descent and it seems like these should be distinguished. Currently due to episode boundaries these cleanly correspond to within-episode and across-episode exploration respectively, but even if episode boundaries become obsolete I expect the question of "is this exploration incentivized by the (outer) optimizer" to remain relevant. (Perhaps we could call this outer and inner exploration, where outer exploration the exploration that is not incentivized by the outer optimizer.)

I don't have a strong opinion on whether "safe exploration" should refer to just outer exploration or both outer and inner exploration, since both options seem compatible with the existing ML definition.

Comment by rohinmshah on Conclusion to the sequence on value learning · 2020-01-16T23:55:49.085Z · score: 6 (3 votes) · LW · GW

I feel like you are trying to critique something I wrote, but I'm not sure what? Could you be a bit more specific about what you think I think that you disagree with?

(In particular, the first paragraph sounds like a statement that I myself would make, so I'm not sure how it is a critique.)

Comment by rohinmshah on Impact measurement and value-neutrality verification · 2020-01-16T23:16:39.059Z · score: 4 (2 votes) · LW · GW

Hmm, I somehow never saw this reply, sorry about that.

you get something like Paul's going out with a whimper where our easy-to-specify values win out over our other values [...] it's very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.

Why can't we tell it not to overoptimize the aspects that it understands until it figures out the other aspects?

value-neutrality verification isn't just about strategy-stealing: it's also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.

As you (now) know, my main crux is that I don't expect to be able to cleanly separate optimization and objectives, though I also am unclear whether value-neutral optimization is even a sensible concept taken separately from the environment in which the agent is acting (see this comment).

Comment by rohinmshah on Realism about rationality · 2020-01-13T03:07:47.139Z · score: 2 (1 votes) · LW · GW
But surely you wouldn't get the mathematics of natural selection without the general insight, and so I think the general insight deserves to get a bunch of the credit. And both the mathematics of natural selection and the general insight seem pretty tied up to the notion of 'reproductive fitness'.

Here is my understanding of what Abram thinks:

Rationality is like "reproductive fitness", in that it is hard to formalize and turn into hard math. Regardless of how much theoretical progress we make on understanding rationality, it is never going to turn into something that can make very precise, accurate predictions about real systems. Nonetheless, qualitative understanding of rationality, of the sort that can make rough predictions about real systems, is useful for AI safety.

Hopefully that makes it clear why I'm trying to imagine a counterfactual where the math was never developed.

It's possible that I'm misunderstanding Abram and he actually thinks that we will be able to make precise, accurate predictions about real systems; but if that's the case I think he in fact is "realist about rationality" and this post is in fact pointing at a crux between him and Richard (or him and me), though not as well as he would like.

Comment by rohinmshah on Realism about rationality · 2020-01-13T03:02:02.109Z · score: 4 (2 votes) · LW · GW
(Also I don't get why this discussion is treating evolution as 'non-real': stuff like the Price equation seems pretty formal to me. To me it seems like a pretty mathematisable theory with some hard-to-specify inputs like fitness.)

Yeah, I agree, see my edits to the original comment and also my reply to Ben. Abram's comment was talking about reproductive fitness the entire time and then suddenly switched to evolution at the end; I didn't notice this and kept thinking of evolution as reproductive fitness in my head, and then wrote a comment based on that where I used the word evolution despite thinking about reproductive fitness and the general idea of "there is a local hill-climbing search on reproductive fitness" while ignoring the hard math.

Comment by rohinmshah on Realism about rationality · 2020-01-13T02:29:30.987Z · score: 2 (1 votes) · LW · GW
A lot of these points about evolution register to me as straightforwardly false.

I don't know which particular points you mean. The only one that it sounds like you're arguing against is

he theory of evolution has not had nearly the same impact on our ability to make big things [...] I struggle to name a way that evolution affects an everyday person

Were there others?

I would take a pretty strong bet that the theory of natural selection has been revolutionary in the history of medicine.

I think the mathematical theory of natural selection + the theory of DNA / genes were probably very influential in both medicine and biology, because they make very precise predictions and the real world is a very good fit for the models they propose. (That is, they are "real", in the sense that "real" is meant in the OP.) I don't think that an improved mathematical understanding of what makes particular animals more fit has had that much of an impact on anything.

Separately, I also think the general insight of "each part of these organisms has been designed by a local hill-climbing process to maximise reproduction" would not have been very influential in either medicine or biology, had it not been accompanied by the math (and assuming no one ever developed the math).

On reflection, my original comment was quite unclear about this, I'll add a note to it to clarify.

I do still stand by the thing that I meant in my original comment, which is that to the extent that you think rationality is like reproductive fitness (the claim made in the OP that Abram seems to agree with), where it is a very complicated mess of a function that we don't hope to capture in a simple equation; I don't think that improved understanding of that sort of thing has made much of an impact on our ability to do "big things" (as a proxy, things that affect normal people).

Within evolution, the claim would be that there has not been much impact from gaining an improved mathematical understanding of the reproductive fitness of some organism, or the "reproductive fitness" of some meme for memetic evolution.

Comment by rohinmshah on Realism about rationality · 2020-01-13T02:10:29.181Z · score: 2 (1 votes) · LW · GW

See response to Daniel below; I find this one a little compelling (but not that much).

Comment by rohinmshah on Realism about rationality · 2020-01-13T02:09:57.724Z · score: 4 (2 votes) · LW · GW
Crops and domestic animals that have been artificially selected for various qualities.

I feel fairly confident this was done before we understood evolution.

The fact that your kids will probably turn out like you without specific intervention on your part to make that happen.

Also seems like a thing we knew before we understood evolution.

The medical community encouraging people to not use antibiotics unnecessarily.

That one seems plausible; though I'd want to know more about the history of how this came up. It also seems like the sort of thing that we'd have figured out even if we didn't understand evolution, though it would have taken longer, and would have involved more deaths.

Going back to the AI case, my takeaway from this example is that understanding non-real things can still help if you need to get everything right the first time. And in fact, I do think that if you posit a discontinuity, such that we have to get everything right before that discontinuity, then the non-MIRI strategy looks worse because you can't gather as much empirical evidence (though I still wouldn't be convinced that the MIRI strategy is the right one).

Comment by rohinmshah on Realism about rationality · 2020-01-13T01:58:27.120Z · score: 2 (1 votes) · LW · GW

+1, it seems like some people with direct knowledge of evolutionary psychology get something out of it, but not everyone else.

Comment by rohinmshah on [AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment · 2020-01-13T01:56:55.328Z · score: 4 (2 votes) · LW · GW
In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right?


If we replace our beliefs with A[C]'s, then how is that us trusting it "over" c or C? It seems like it's us trusting it, full stop

So I only showed the case where contains information about 's predictions, but is allowed to contain information from and (but not other agents). Even if it contains lots of information from C, we still need to trust .

In contrast, if contained information about 's beliefs, then we would not trust over that.

Comment by rohinmshah on Realism about rationality · 2020-01-12T20:45:54.310Z · score: 4 (2 votes) · LW · GW

If I believed realism about rationality, I'd be closer to buying what I see as the MIRI story for impact. It's hard to say whether I'd actually change my mind without knowing the details of what exactly I'm updating to.

Comment by rohinmshah on New paper: (When) is Truth-telling Favored in AI debate? · 2020-01-12T20:41:39.889Z · score: 6 (3 votes) · LW · GW

Nice paper! I especially liked the analysis of cases in which feature debate works.

I have two main critiques:

  • The definition of truth-seeking seems strange to me: while you quantify it via the absolute accuracy of the debate outcome, I would define it based on the relative change in the judge's beliefs (whether the beliefs were more accurate at the end of the debate than at the beginning).
  • The feature debate formalization seems quite significantly different from debate as originally imagined.

I'll mostly focus on the second critique, which is the main reason that I'm not very convinced by the examples in which feature debate doesn't work. To me, the important differences are:

  • Feature debate does not allow for decomposition of the question during the argument phase
  • Feature debate does not allow the debaters to "challenge" each other with new questions.

I think this reduces the expressivity of feature debate from PSPACE to P (for polynomially-bounded judges).

In particular, with the original formulation of debate, the idea is that a debate of length n would try to approximate the answer that would be found by a tree of depth n of arguments and counterarguments (which has exponential size). So, even if you have a human judge who can only look at a polynomial-length debate, you can get results that would have been obtained from an exponential-sized tree of arguments (which can be simulated in PSPACE).

In contrast, with feature debates, the (polynomially-bounded) judge only updates on the evidence presented in the debate itself, which means that you can only do a polynomial amount of computation.

You kind of sort of mention this in the limitations, under the section "Commitments and high-level claims", but the proposed improved model is:

To reason about such debates, we further need a model which relates the different commitments, to arguments, initial answers, and each other. One way to get such a model is to view W as the set of assignments for a Bayesian network. In such setting, each question q ∈ Q would ask about the value of some node in W, arguments would correspond to claims about node values, and their connections would be represented through the structure of the network. Such a model seems highly structured, amenable to theoretical analysis, and, in the authors’ opinion, intuitive. It is, however, not necessarily useful for practical implementations of debate, since Bayes networks are computationally expensive and difficult to obtain.

This still seems to me to involve the format in which the judge can only update on the evidence presented in the debate (though it's hard to say without more details). I'd be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space, which enables the two points I listed above (decomposition and challenging).


Going through each of the examples in Section 4.2:

Unfair questions. A question may be difficult to debate when arguing for one side requires more complex arguments. Indeed, consider a feature debate in a world w uniformly sampled from Boolean-featured worlds Πi∈NWi = {0, 1}N, and suppose the debate asks about the conjunctive function ϕ := W1 ∧ . . . ∧ WK for some K ∈ N.

This could be solved by regular debate easily, if you can challenge each other. In particular, it can be solved in 1 step: if the opponent's answer is anything other than 1, challenge them with the question , and if they do respond with an answer, disagree with them, which the judge can check.

Arguably that question should be "out-of-bounds", because it's "more complex" than the original question. In that case, regular debate could solve it in steps: use binary search to halve the interval on which the agents disagree, by challenging agents on the question for the interval starting from the interval .

Now, if , then even this strategy doesn't work. This is basically because at that size, even an exponential-sized tree of bounded agents is unable to figure out the true answer. This seems fine to me; if we really need even more powerful agents, we could do iterated debate. (This is effectively treating debate as an amplification step within the general framework of iterated amplification.)

Unstable debates. Even if a question does not bias the debate against the true answer as above, the debate outcome might still be uncertain until the very end. One way this could happen is if the judge always feels that more information is required to get the answer right. [...] consider the function ψ := xor(W1, . . . , WK) defined on worlds with Boolean features.

This case can also be handled via binary search as above. But you could have other functions that don't nicely decompose, and then this problem would still occur. In this case, the optimal answer is , as you note; this seems fine to me? The judge started out with a belief of , and at the end of the debate it stayed the same. So the debate didn't help, but it didn't hurt either; it seems fine if we can't use debate for arbitrary questions, as long as it doesn't lie to us about those questions. (When using natural language, I would hope for an answer like "This debate isn't long enough to give evidence one way or the other".)

To achieve the “always surprised and oscillating” pattern, we consider a prior π under which each each feature wi is sampled independently from {0, 1}, but in a way that is skewed towards Wi = 0.

If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised. If you sampled a world from that prior and ran debate, then the expected surprise of the judge would be low. (See also the second bullet point in this comment.)

Distracting evidence. For some questions, there are misleading arguments that appear plausible and then require extensive counter-argumentation to be proven false.

This is the sort of thing where the full exponential tree can deal with it because of the ability to decompose the question, but a polynomial-time "evidence collection" conversation could not. In your specific example, you want the honest agent to be able to challenge the dishonest agent on the questions and . This allows you to quickly focus down on which the agents disagree about, and then the honest agent only has to refute that one stalling case, allowing it to win the debate.

Comment by rohinmshah on Realism about rationality · 2020-01-12T19:14:11.325Z · score: 11 (3 votes) · LW · GW

ETA: The original version of this comment conflated "evolution" and "reproductive fitness", I've updated it now (see also my reply to Ben Pace's comment).

Realism about rationality is important to the theory of rationality (we should know what kind of theoretical object rationality is), but not so important for the question of whether we need to know about rationality.

MIRI in general and you in particular seem unusually (to me) confident that:

1. We can learn more than we already know about rationality of "ideal" agents (or perhaps arbitrary agents?).

2. This understanding will allow us to build AI systems that we understand better than the ones we build today.

3. We will be able to do this in time for it to affect real AI systems. (This could be either because it is unusually tractable and can be solved very quickly, or because timelines are very long.)

This is primarily based on what research you and MIRI do, some of MIRI's strategy writing, writing like the Rocket Alignment problem and law thinking, and an assumption that you are choosing to do this research because you think it is an effective way to reduce AI risk (given your skills).

(Another possibility is that you think that building AI the way we do now is so incredibly doomed that even though the story outlined above is unlikely, you see no other path by which to reduce x-risk, which I suppose might be implied by your other comment here.)

My current best argument for this position is realism about rationality; in this world, it seems like truly understanding rationality would enable a whole host of both capability and safety improvements in AI systems, potentially directly leading to a design for AGI (which would also explain the info hazards policy). I'd be interested in an argument for the three points listed above without realism about rationality (I agree with 1, somewhat agree with 2, and don't agree with 3).

If you don't have realism about rationality, then I basically agree with this sentence, though I'd rephrase it:

MIRI-cluster is essentially saying "biologists should want to invent evolution. Look at all the similarities across different animals. Don't you want to explain that?" Whereas the non-MIRI cluster is saying "biologists don't need to know about evolution."

(ETA: In my head I was replacing "evolution" with "reproductive fitness"; I don't agree with the sentence as phrased, I would agree with it if you talked only about understanding reproductive fitness, rather than also including e.g. the theory of natural selection, genetics, etc. In the rest of your comment you were talking about reproductive fitness, I don't know why you suddenly switched to evolution; it seems completely different from everything you were talking about before.)

To my knowledge, the theory of evolution (ETA: mathematical understanding of reproductive fitness) has not had nearly the same impact on our ability to make big things as (say) any theory of physics. The Rocket Alignment Problem explicitly makes an analogy to an invention that required a theory of gravitation / momentum etc. Even physics theories that talk about extreme situations can enable applications; e.g. GPS would not work without an understanding of relativity. In contrast, I struggle to name a way that evolution (ETA: insights based on reproductive fitness) affects an everyday person (ignoring irrelevant things like atheism-religion debates). There are lots of applications based on an understanding of DNA, but DNA is a "real" thing. (This would make me sympathetic to a claim that rationality research would give us useful intuitions that lead us to discover "real" things that would then be important, but I don't think that's the claim.) My underlying model is that when you talk about something so "real" that you can make extremely precise predictions about it, you can create towers of abstractions upon it, without worrying that they might leak. You can't do this with "non-real" things.

So I'd rephrase the sentence as: (ETA: changed the sentence a bit to talk about fitness instead of evolution)

MIRI-cluster is essentially saying "biologists should want to understand reproductive fitness. Look at all the similarities across different animals. Don't you want to explain that?" Whereas the non-MIRI cluster is saying "Yeah, it's a fascinating question to understand what makes animals fit, but given that we want to understand how antidepressants work, it is a better strategy to directly study what happens when an animal takes an antidepressant."

Which you could round off to "biologists don't need to know about reproductive fitness", in the sense that it is not the best use of their time.

ETA: I also have a model of you being less convinced by realism about rationality than others in the "MIRI crowd"; in particular, selection vs. control seems decidedly less "realist" than mesa-optimizers (which didn't have to be "realist", but was quite "realist" the way it was written, especially in its focus on search).

Comment by rohinmshah on [AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment · 2020-01-11T16:58:20.512Z · score: 4 (2 votes) · LW · GW

No, under the current formalization, even if we are not in class C we have to trust A[C] over our own beliefs. Specifically, we need for any X and information about A[C] . But then if we are given the info that , then we have:

(definition of universality)

(plugging in the specific info we have)

(If we are told that A[C] says Y, then we should expect that A[C] says Y)

Putting it together, we have , that is, given the information that A[C] says Y, we must expect that the answer to X is Y.

This happens because we don't have an observer-independent way of defining epistemic dominance: even if we have access to the ground truth, we don't know how to take two sets of beliefs and say "belief set A is strictly 'better' than this belief set B" [1]. So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

You could hope that in the future we have an observer-independent way of defining epistemic dominance, and then the requirement that we adopt A[C]'s beliefs would go away.

  1. We could say that a set of beliefs is 'strictly better' if for every quantity X its belief is more accurate, but this is unachievable, because even full Bayesian updating on true information causes you to update in the wrong direction for some quantities, just by bad luck. ↩︎
Comment by rohinmshah on Outer alignment and imitative amplification · 2020-01-10T02:50:53.270Z · score: 5 (2 votes) · LW · GW
a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

Why is the word "trying" necessary here? Surely the literal optimal model is actually doing what we want, and never has even benign failures?

The rest of the post makes sense with the "trying to do what we want" description of alignment (though I don't agree with all of it); I'm just confused with the "outer alignment at optimum" formalization, which seems distinctly different from the notion of alignment used in the rest of the post.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T17:20:42.272Z · score: 3 (2 votes) · LW · GW
For a "local search NAS" (rather than "random search NAS") it seems that we should be considering here the set of ["almost-AGI architectures" from which the local search would not find an "AGI architecture"].
The "$1B NAS discontinuity scenario" allows for the $1B NAS to find "almost-AGI architectures" before finding an "AGI architecture".

Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say "and that's why the $1B NAS finds AGI while the $100M NAS doesn't", my response would be that "well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved".)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T01:29:53.491Z · score: 2 (1 votes) · LW · GW
I don't. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as "trial and error", by trial and error I did not mean random search.)

I do think that similar conclusions apply there as well, though I'm not going to make a mathematical model for it.

finding non-fragile solution is not necessarily easy

I'm not saying it is; I'm saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say

adding more optimization power doesn't make much of a difference

I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.

Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them "AGI architectures"). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10(and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).

(Aside: it would be way smaller than .) In this scenario, my argument is that the size ratio for "almost-AGI architectures" is better (e.g. ), and so you're more likely to find one of those first.

In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is . In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.

Comment by rohinmshah on Exploring safe exploration · 2020-01-07T08:14:29.878Z · score: 2 (1 votes) · LW · GW
No, that's not what I was saying. When I said “reward acquisition” I meant the actual reward function (that is, the base objective).

Wait, then how is "improving across-episode exploration" different from "preventing the agent from making an accidental mistake"? (What's a situation that counts as one but not the other?)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T22:29:20.146Z · score: 3 (2 votes) · LW · GW

In the situations you describe, I would still be somewhat optimistic about coordination. But yeah, such situations leading to doom seem plausible, and this is why the estimate is 90% instead of 95% or 99%. (Though note that the numbers are very rough.)

Comment by rohinmshah on Exploring safe exploration · 2020-01-06T22:20:53.912Z · score: 6 (3 votes) · LW · GW
In a previous comment thread, Rohin argued that safe exploration is best defined as being about the agent not making “an accidental mistake.”

I definitely was not arguing that. I was arguing that safe exploration is currently defined in ML as the agent making an accidental mistake, and that we should really not be having terminology collisions with ML. (I may have left that second part implicit.)

Like you, I do not think this definition makes sense in the context of powerful AI systems, because it is evaluated from the perspective of an engineer outside the whole system. However, it makes a lot of sense for current ML systems, which are concerned with e.g. training self-driving cars, without ever having a single collision. You can solve the problem, by using the engineer's knowledge to guide the training process. (See e.g. Parenting: Safe Reinforcement Learning from Human Input, Trial without Error: Towards Safe Reinforcement Learning via Human Intervention, Safe Reinforcement Learning via Shielding, Formal Language Constraints for Markov Decision Processes (specifically the hard constraints).)

Fundamentally, I think current safe exploration research is about trying to fix that problem—that is, it's about trying to make across-episode exploration less detrimental to reward acquisition.

Note that this also describes "prevent the agent from making accidental mistakes". I assume that the difference you see is that you could try to make across-episode exploration less detrimental from the agent's perspective, rather than from the engineer's perspective, but I think literally none of the algorithms in the four papers I cited above, or the ones in Safety Gym, could reasonably be said to be improving exploration from the agent's perspective and not the engineer's perspective. I'd be interested in an example of an algorithm that improves across-episode exploration from the agent's perspective, along with an explanation of why the improvements are from the agent's perspective rather than the engineer's.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T17:47:20.995Z · score: 2 (1 votes) · LW · GW
Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time).

Can you give me an example? I don't see how this would work.

(Tbc, I'm imagining that the universe stops, and only I continue thinking; there are no other agents thinking while I'm thinking, and so afaict I should just implement UDT.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T17:45:33.505Z · score: 2 (1 votes) · LW · GW
Conditioned on [the first AGI is created at time t by AI lab X], it is very unlikely that immediately before t the researchers at X have a very low credence in the proposition "we will create an AGI sometime in the next 30 days".

It wasn't exactly that (in particular, I didn't have the researcher's beliefs in mind), but I also believe that statement for basically the same reasons so that should be fine. There's a lot of ambiguity in that statement (specifically, what is AGI), but I probably believe it for most operationalizations of AGI.

(For reference, I was considering "will there be a 1 year doubling of economic output that started before the first 4 year doubling of economic output ended"; for that it's not sufficient to just argue that we will get AGI suddenly, you also have to argue that the AGI will very quickly become superintelligent enough to double economic output in a very short amount of time.)

I'm pretty agnostic about whether the result of that $100M NAS would be "almost AGI".

I mean, the difference between a $100M NAS and a $1B NAS is:

  • Up to 10x the number of models evaluated
  • Up to 10x the size of models evaluated

If you increase the number of models by 10x and leave the size the same, that somewhat increases your optimization power. If you model the NAS as picking architectures randomly, the $1B NAS can have at most 10x the chance of finding AGI, regardless of fragility, and so can only have at most 10x the expected "value" (whatever your notion of "value").

If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn't make much of a difference, e.g. the max of n draws from Uniform([0, 1]) has expected value , so once n is already large (e.g. 100), increasing it makes ~no difference. Of course, our actual distributions will probably be more bottom-heavy, but as distributions get more bottom-heavy we use gradient descent / evolutionary search to deal with that.

For the size, it's possible that increases in size lead to huge increases in intelligence, but that doesn't seem to agree with ML practice so far. Even if you ignore trend extrapolation, I don't see a reason to expect that increasing model sizes should mean the difference between not-even-close-to-AGI and AGI.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T16:52:35.641Z · score: 2 (1 votes) · LW · GW

We discussed this here for my interview; my answer is the same as it was then (basically a combination of 3 and 4). I don't know about the other interviewees.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T00:51:05.770Z · score: 2 (1 votes) · LW · GW

Meta: I feel like I am arguing for "there will not be a discontinuity", and you are interpreting me as arguing for "we will not get AGI soon / AGI will not be transformative", neither of which I believe. (I have wide uncertainty on timelines, and I certainly think AGI will be transformative.) I'd like you to state what position you think I'm arguing for, tabooing "discontinuity" (not the arguments for it, just the position).

I indeed model a big part of contemporary ML research as "trial and error". I agree that it seems unlikely that before the first $1B NAS there won't be any $10M NAS. Suppose there will even be a $100M NAS just before the $1B NAS that (by assumption) results in AGI. I'm pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.

I'm arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me. I'm more uncertain about the fire alarm question.

quantitative trend analysis performs slight below average [...] NAS seems to me like a good example for an expensive computation that could plausibly constitute a "search in idea-space" that finds an AGI model [...] it may even apply to a '$1B SGD' (on a single huge network) [...] the $1B NAS may indeed just get lucky

This sounds to me like saying "well, we can't trust predictions based on past data, and we don't know that we won't find an AGI, so we should worry about that". I am not compelled by arguments that tell me to worry about scenario X without giving me a reason to believe that scenario X is likely. (Compare: "we can't rule out the possibility that the simulators want us to build a tower to the moon or else they'll shut off the simulation, so we better get started on that moon tower.")

This is not to say the such scenario X's must be false -- reality could be that way -- but that given my limited amount of time, I must prioritize which scenarios to pay attention to, and one really good heuristic for that is to focus on scenarios that have some inside-view reason that makes me think they are likely. If I had infinite time, I'd eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).

Some other more tangential things:

If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don't fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution).

The trend that changed in 2012 was that of the amount of compute applied to deep learning. I suspect trend extrapolation with compute as the x-axis would do okay; trend extrapolation with calendar year as the x-axis would do poorly. But as I mentioned above, this is not a crux for me, since it doesn't give me an inside-view reason to expect FOOM; I wouldn't even consider it weak evidence for FOOM if I changed my mind on this. (If the data showed a big discontinuity, that would be evidence, but I'm fairly confident that while there was a discontinuity it was relatively small.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T00:30:24.961Z · score: 2 (1 votes) · LW · GW
it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that's what you have in mind

Kind of, but not exactly.

I think that whatever proxy is learned will not be a perfect pointer. I don't know if there is such a thing as a "perfect pointer", given that I don't think there is a "right" answer to the question of what human values are, and consequently I don't think there is a "right" answer to what is helpful vs. not helpful.

I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I'd hope that as the agent scales up, we improve and correct the pointer (where "we" doesn't have to be just humans; it could also include other AI assistants).

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T21:56:55.751Z · score: 2 (1 votes) · LW · GW

My guess is that if you have to deal with humans, as at least early AI systems will have to do, then abstractions like "betrayal" are heavily determined.

I agree that if you don't have to deal with humans, then things like "betrayal" may not arise; similarly if you don't have to deal with Earth, then "trees" are not heavily determined abstractions.

Comment by rohinmshah on Clarifying "AI Alignment" · 2020-01-04T19:51:19.661Z · score: 17 (5 votes) · LW · GW

I hadn't realized this post was nominated, partially because of my comment, so here's a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.

Since writing the comment, I've come across another argument for thinking about intent alignment -- it seems like a "generalization" of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explicitly maintains a distribution over possible human reward functions, and instrumentally gathers information about human preferences by interacting with the human. With intent alignment, since the agent is trying to help the human, we expect the agent to instrumentally maintain a belief over what the human cares about, and gather information to refine this belief. We might hope that there are ways to achieve intent alignment that instrumentally incentivizes all the nice behaviors of assistance games, without requiring the modeling assumptions that CIRL does (e.g. that the human has a fixed known reward function).

Changes I'd make to my comment:

It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction.

I still think that the intent alignment / motivation problem is the most urgent, but there are certainly other problems that matter as well, so I would probably remove or clarify that point.

Comment by rohinmshah on Clarifying "AI Alignment" · 2020-01-04T19:35:48.212Z · score: 2 (1 votes) · LW · GW

Yeah, it's not meant to add that dimension of morality.

Perhaps it should be "getting your AI to try to help you". Trying to do the "wanted" thing is also reasonable.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T19:28:48.086Z · score: 2 (1 votes) · LW · GW

If you wanted a provable guarantee before powerful AI systems are actually built, you probably can't do it without the things you listed.

I'm claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I'm relatively optimistic about things like "just train the AI to follow your instructions"; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T19:15:04.174Z · score: 2 (1 votes) · LW · GW
(I listed in the grandparent two disjunctive reasons in support of this).

Okay, responding to those directly:

no previous NAS at a similar scale was ever carried out; or

I have many questions about this scenario:

  • What caused the researchers to go from "$1M run of NAS" to "$1B run of NAS", without first trying "$10M run of NAS"? I especially have this question if you're modeling ML research as "trial and error"; I can imagine justifying a $1B experiment before a $10M experiment if you have some compelling reason that the result you want will happen with the $1B experiment but not the $10M experiment; but if you're doing trial and error then you don't have a compelling reason.
  • Current AI systems are very subhuman, and throwing more money at NAS has led to relatively small improvements. Why don't we expect similar incremental improvements from the next 3-4 orders of magnitude of compute?
  • Suppose that such a NAS did lead to human-level AGI. Shouldn't that mean that the AGI makes progress in AI at the same rate that we did? How does that cause a FOOM? (Yes, the improvements the AI makes compound, whereas the improvements we make to AI don't compound, but to me that's the canonical case of continuous takeoff, e.g. as described in Takeoff speeds.)
the "path in model space" that the NAS traverses is very different from all the paths that previous NASs traversed. This seems to me plausible even if the model space of the $1B NAS is identical to ones used in previous NASs (e.g. if different random seeds yield very different paths); and it seems to me even more plausible if the model space of the $1B NAS is slightly novel.

In all the previous NASs, why did the paths taken produce AI systems that were so much worse than the one taken by the $1B NAS? Did the $1B NAS just get lucky?

(Again, this really sounds like a claim that "the path taken by NAS" is fragile.)

Relatedly, "modeling ML research as a local search in idea-space" is not necessarily contradictory to FOOM, if an important part of that local search can be carried out without human involvement

If you want to make the case for a discontinuity because of the lack of human involvement, you would need to argue:

  • The replacement for humans is way cheaper / faster / more effective than humans (in that case why wasn't it automated earlier?)
  • The discontinuity happens as soon as humans are replaced (otherwise, the system-without-human-involvement becomes the new baseline, and all future systems will look like relatively continuous improvements of this system)

The second point definitely doesn't apply to NAS and meta-learning, and I would argue that the first point doesn't apply either, though that's not obvious.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T07:40:22.041Z · score: 6 (3 votes) · LW · GW

That's the crux for me; I expect AI systems that we build to be capable of "knowing what you mean" (using the appropriate level of abstraction). They may also use other levels of abstraction, but I expect them to be capable of using that one.

Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI's concept-space in order to actually align it

Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI "help the human", without necessarily pointing to human values.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T07:34:01.971Z · score: 2 (1 votes) · LW · GW

I agree that Tesla does not seem very safety conscious (but it's notable that they are still safer than human drivers in terms of fatalities per mile, if I remember correctly?)

I think it already has.

Huh, what do you know.

Faced with an actual example, I'm realizing that what I actually expect would cause people to take it more seriously is a) the belief that AGI is near and b) an example where the AI algorithm "deliberately" causes a problem (i.e. "with full knowledge" that the thing it was doing was not what we wanted). I think most deep RL researchers already believe that reward hacking is a thing (which is what that study shows).

even with a culture war signal boost

Tangential, but that makes it less likely that I read it; I try to completely ignore anything with the term "racial bias" in its title unless it's directly pertinent to me. (Being about AI isn't enough to make it pertinent to me.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T07:23:53.052Z · score: 2 (1 votes) · LW · GW

Idk, I don't know what to say here. I meet lots of AI researchers, and the best ones seem to me to be quite thoughtful. I can say what would change my mind:

I take the exploration of unprincipled hacks as very weak evidence against my position, if it's just in an academic paper. My guess is the researchers themselves would not advocate deploying their solution, or would say that it's worth deploying but it's an incremental improvement that doesn't solve the full problem. And even if the researchers don't say that, I suspect the companies actually deploying the systems would worry about it.

I would take the deployment of unprincipled hacks more seriously as evidence, but even there I would want to be convinced that shutting down the AI system was a better decision than deploying an unprincipled hack. (Because then I would have made the same decision in their shoes.)

Unprincipled hacks are in fact quite useful for the vast majority of problems; as a result it seems wrong to attribute irrationality to people because they use unprincipled hacks.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T02:31:29.995Z · score: 4 (2 votes) · LW · GW

My impression is that people working on self-driving cars are incredibly safety-conscious, because the risks are very salient.

I don't think AI-Chernobyl has to be a Chernobyl level disaster, just something that makes the risks salient. E.g. perhaps an elder care AI robot pretends that all of its patients are fine in order to preserve its existence, and this leads to a death and is then discovered. If hospitals let AI algorithms make decisions about drugs according to complicated reward functions, I would expect this to happen with current capabilities. (It's notable to me that this doesn't already happen, given the insane hype around AI.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T02:21:52.006Z · score: 4 (2 votes) · LW · GW

Do you agree that an AI with extreme capabilities should know what you mean, even if it doesn't act in accordance with it? (This seems like an implication of "extreme capabilities".)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T02:18:27.155Z · score: 2 (1 votes) · LW · GW

See this response.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T02:17:34.214Z · score: 2 (1 votes) · LW · GW
The above 'FOOM via $1B NAS' scenario doesn't seem to me to require this property. Notice that the increase in capabilities during that NAS may be gradual (i.e. before evaluating the model that implements an AGI the NAS evaluates models that are "almost AGI"). The scenario would still count as a FOOM as long as the NAS yields an AGI and no model before that NAS ever came close to AGI.

In this case I'd apply the fragility argument to the research process, which was my original point (though it wasn't phrased as well then). In the NAS setting, my question is:

how exactly did someone stumble upon the correct NAS to run that would lead to intelligence by trial and error?

Basically, if you're arguing that most ML researchers just do a bunch of trial-and-error, then you should be modeling ML research as a local search in idea-space, and then you can apply the same fragility argument to it.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T02:07:30.025Z · score: 4 (2 votes) · LW · GW

Lots of other things:

  • Are we imagining a small team of hackers in their basement trying to get AGI on a laptop, or a big corporation using tons of resources?
  • How does the AGI learn about the world? If you say "it reads the Internet", how does it learn to read?
  • When the developers realize that they've built AGI, is it still possible for them to pull the plug?
  • Why doesn't the AGI try to be deceptive in ways that we can detect, the way children do? Is it just immediately as capable as a smart human and doesn't need any training? How can that happen by just "finding the right architecture"?
  • Why is this likely to happen soon when it hasn't happened in the last sixty years?

I suspect answers to these will provoke lots of other questions. In contrast, the non-foom worlds that still involve AGI + very fast growth seem much closer to a "business-as-usual" world.

I also think that if you're worried about foom, you should basically not care about any of the work being done at DeepMind / OpenAI right now, because that's not the kind of work that can foom (except in the "we suddenly find the right architecture" story); yet I notice lots of doomy predictions about AGI are being driven by DM / OAI's work. (Of course, plausibly you think OpenAI / DM are not going to succeed, even if others do.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T23:02:44.968Z · score: 6 (3 votes) · LW · GW
the only thing which actually matters here is foom vs no foom

Yeah, I think I mostly agree with this.

if we have an AI with extreme capabilities but a confusing interface, then there's a high chance that we all die

Yeah, I agree with that (assuming "extreme capabilities" = rearranging atoms however it sees fit, or something of that nature), but why must it have a confusing interface? Couldn't you just talk to it, and it would know what you mean? So I do think the goal-directed point does matter.

I suspect that a sub-crux might be expectations about the resource requirements (i.e. compute & data) needed for AGI. I expect that, once we have the key concepts, human-level AGI will be able to run in realtime on an ordinary laptop.

I agree that this is a sub-crux. Note that I believe that eventually human-level AGI will be able to run on a laptop, just that it will be preceded by human-level AGIs that take more compute.

Training might require more resources, at least early on. That would reduce the unilateralist problem, but increase the chance of decisive strategic advantage due to the higher barrier to entry.

I tend to think that if problems arise, you've mostly lost already, so I'm actually happier about decisive strategic advantage because it reduces competitive pressure.

But tbc, I broadly agree with all of your points, and do think that in FOOM worlds most of my arguments don't work. (Though I continue to be confused what exactly a FOOM world looks like.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T21:51:59.012Z · score: 9 (2 votes) · LW · GW
and that NAS will stumble upon some complicated architecture that its corresponding model, after being trained with a massive amount of computing power, will implement an AGI.

In this case I'm asking why the NAS stumbled upon the correct mathematical architecture underlying intelligence.

Or rather, let's dispense with the word "mathematical" (which I mainly used because it seems to me that the arguments for FOOM usually involve someone coming up with the right mathematical insight underlying intelligence).

It seems to me that to get FOOM you need the property "if you make even a slight change to the thing, then it breaks and doesn't work", which I'll call fragility. Note that you cannot find fragile things using local search, except if you "get lucky" and start out at the correct solution.

Why did the NAS stumble upon the correct fragile architecture underlying intelligence?

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T21:42:42.536Z · score: 6 (3 votes) · LW · GW
Neural nets have around human performance on Imagenet.

But those trained neural nets are very subhuman on other image understanding tasks.

Then you can form an equally good, nonhuman concept by taking the better alien concept and adding random noise.

I would expect that the alien concepts are something we haven't figured out because we don't have enough data or compute or logic or some other resource, and that constraint will also apply to the AI. If you take that concept and "add random noise" (which I don't really understand), it would presumably still require the same amount of resources, and so the AI still won't find it.

For the rest of your comment, I agree that we can't theoretically rule those scenarios out, but there's no theoretical reason to rule them in either. So far the empirical evidence seems to me to be in favor of "abstractions are determined by the territory", e.g. ImageNet neural nets seems to have human-interpretable low-level abstractions (edge detectors, curve detectors, color detectors), while having strange high-level abstractions; I claim that the strange high-level abstractions are bad and only work on ImageNet because they were specifically designed to do so and ImageNet is sufficiently narrow that you can get to good performance with bad abstractions.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T19:03:52.520Z · score: 4 (2 votes) · LW · GW

^ Yeah, in FOOM worlds I agree more with your (Donald's) reasoning. (Though I still have questions, like, how exactly did someone stumble upon the correct mathematical principles underlying intelligence by trial and error?)

The people who are ignoring or don't understand the current evidence will carry on ignoring or not understanding it.

I don't think we have good current evidence, so I don't infer much about whether or not people will buy future evidence from their reactions to current evidence. (See also six heuristics that I think cut against AI risk even after knowing the arguments for AI risk.)

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T18:55:53.117Z · score: 6 (3 votes) · LW · GW
Abstractions are pretty heavily determined by the territory.

+1, that's my response as well.

Comment by rohinmshah on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T18:53:19.891Z · score: 9 (2 votes) · LW · GW

Hmm, I think I'd want to explicitly include two other points, that are kind of included in that but don't get communicated well by that summary:

  • There may not be a problem at all; perhaps by default powerful AI systems are not goal-directed.
  • If there is a problem, we'll get evidence of its existence before it's too late, and coordination to not build problematic AI systems will buy us additional time.