Distinguishing claims about training vs deployment 2021-02-03T11:30:06.636Z
Deutsch and Yudkowsky on scientific explanation 2021-01-20T01:00:04.235Z
Some thoughts on risks from narrow, non-agentic AI 2021-01-19T00:04:10.108Z
Excerpt from Arbital Solomonoff induction dialogue 2021-01-17T03:49:47.405Z
Why I'm excited about Debate 2021-01-15T23:37:53.861Z
Meditations on faith 2021-01-15T22:20:02.651Z
Eight claims about multi-agent AGI safety 2021-01-07T13:34:55.041Z
Commentary on AGI Safety from First Principles 2020-11-23T21:37:31.214Z
Continuing the takeoffs debate 2020-11-23T15:58:48.189Z
My intellectual influences 2020-11-22T18:00:04.648Z
Why philosophy of science? 2020-11-07T11:10:02.273Z
Responses to Christiano on takeoff speeds? 2020-10-30T15:16:02.898Z
Reply to Jebari and Lundborg on Artificial Superintelligence 2020-10-25T13:50:23.601Z
AGI safety from first principles: Conclusion 2020-10-04T23:06:58.975Z
AGI safety from first principles: Control 2020-10-02T21:51:20.649Z
AGI safety from first principles: Alignment 2020-10-01T03:13:46.491Z
AGI safety from first principles: Goals and Agency 2020-09-29T19:06:30.352Z
AGI safety from first principles: Superintelligence 2020-09-28T19:53:40.888Z
AGI safety from first principles: Introduction 2020-09-28T19:53:22.849Z
Safety via selection for obedience 2020-09-10T10:04:50.283Z
Safer sandboxing via collective separation 2020-09-09T19:49:13.692Z
The Future of Science 2020-07-28T02:43:37.503Z
Thiel on Progress and Stagnation 2020-07-20T20:27:59.112Z
Environments as a bottleneck in AGI development 2020-07-17T05:02:56.843Z
A space of proposals for building safe advanced AI 2020-07-10T16:58:33.566Z
Arguments against myopic training 2020-07-09T16:07:27.681Z
AGIs as collectives 2020-05-22T20:36:52.843Z
Multi-agent safety 2020-05-16T01:59:05.250Z
Competitive safety via gradated curricula 2020-05-05T18:11:08.010Z
Against strong bayesianism 2020-04-30T10:48:07.678Z
What is the alternative to intent alignment called? 2020-04-30T02:16:02.661Z
Melting democracy 2020-04-29T20:10:01.470Z
Richard Ngo's Shortform 2020-04-26T10:42:18.494Z
What achievements have people claimed will be warning signs for AGI? 2020-04-01T10:24:12.332Z
What information, apart from the connectome, is necessary to simulate a brain? 2020-03-20T02:03:15.494Z
Characterising utopia 2020-01-02T00:00:01.268Z
Technical AGI safety research outside AI 2019-10-18T15:00:22.540Z
Seven habits towards highly effective minds 2019-09-05T23:10:01.020Z
What explanatory power does Kahneman's System 2 possess? 2019-08-12T15:23:20.197Z
Why do humans not have built-in neural i/o channels? 2019-08-08T13:09:54.072Z
Book review: The Technology Trap 2019-07-20T12:40:01.151Z
What are some of Robin Hanson's best posts? 2019-07-02T20:58:01.202Z
On alien science 2019-06-02T14:50:01.437Z
A shift in arguments for AI risk 2019-05-28T13:47:36.486Z
Would an option to publish to AF users only be a useful feature? 2019-05-20T11:04:26.150Z
Which scientific discovery was most ahead of its time? 2019-05-16T12:58:14.628Z
When is rationality useful? 2019-04-24T22:40:01.316Z
Book review: The Sleepwalkers by Arthur Koestler 2019-04-23T00:10:00.972Z
Arguments for moral indefinability 2019-02-12T10:40:01.226Z
Coherent behaviour in the real world is an incoherent concept 2019-02-11T17:00:25.665Z


Comment by Richard_Ngo (ricraz) on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-06-18T03:46:24.591Z · LW · GW

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of the human time and effort that was put into originally training the system, then I am much more sympathetic to the idea that what's being described is a central example of misalignment.

My (unsubstantiated) guess is that Paul pictures alignment check-ins becoming much harder (i.e. closer to the latter case mentioned above) as capabilities increase? Whereas maybe Critch thinks that they remain fairly easy in terms of number of humans and time taken, but that over time even this becomes economically uncompetitive.

Comment by Richard_Ngo (ricraz) on Taboo "Outside View" · 2021-06-17T13:29:06.937Z · LW · GW

I really like this post, it feels like it draws attention to an important lack of clarity.

One thing I'd suggest changing: when introducing new terminology, I think it's much better to use terms that are already widely comprehensible if possible, than terms based on specific references which you'd need to explain to people who are unfamiliar in each case.

So I'd suggest renaming 'ass-number' to wild guess and 'foxy aggregation' to multiple models or similar.

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-06-03T23:48:25.742Z · LW · GW

I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).

Comment by Richard_Ngo (ricraz) on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-05-29T07:45:49.249Z · LW · GW

Rhymes with carp.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2021-05-27T09:26:09.500Z · LW · GW

In the scenario governed by data, the part that counts as self-improvement is where the AI puts itself through a process of optimisation by stochastic gradient descent with respect to that data.

You don't need that much hardware for data to be a bottleneck. For example, I think that there are plenty of economically valuable tasks that are easier to learn than StarCraft. But we get StarCraft AIs instead because games are the only task where we can generate arbitrarily large amounts of data.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2021-05-26T10:58:01.153Z · LW · GW

Yudkowsky mainly wrote about recursive self-improvement from a perspective in which algorithms were the most important factors in AI progress - e.g. the brain in a box in a basement which redesigns its way to superintelligence.

Sometimes when explaining the argument, though, he switched to a perspective in which compute was the main consideration - e.g. when he talked about getting "a hyperexponential explosion out of Moore’s Law once the researchers are running on computers".

What does recursive self-improvement look like when you think that data might be the limiting factor? It seems to me that it looks a lot like iterated amplification: using less intelligent AIs to provide a training signal for more intelligent AIs.

I don't consider this a good reason to worry about IA, though: in a world where data is the main limiting factor, recursive approaches to generating it still seem much safer than alternatives.

Comment by Richard_Ngo (ricraz) on Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare · 2021-05-20T12:18:11.047Z · LW · GW

Yeah, this seems like a reasonable argument. It feels like it really relies on this notion of "pretty smart" though, which is hard to pin down. There's a case for including all of the following in that category:

And yet I'd guess that none of these were/are on track to reach human-level intelligence. Agree/disagree?

Comment by Richard_Ngo (ricraz) on Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare · 2021-05-20T07:07:48.953Z · LW · GW

My argument is consistent with the time from dolphin- to human-level intelligence being short in our species, because for anthropic reasons we find ourselves with all the necessary features (dexterous fingers, sociality, vocal chords, etc).

The claim I'm making is more like: for every 1 species that reaches human-level intelligence, there will be N species that get pretty smart, then get stuck, where N is fairly large. (And this would still be true if neurons were, say, 10x smaller and 10x more energy efficient.)

Now there are anthropic issues with evaluating this argument by pegging "pretty smart" to whatever level the second-most-intelligent species happens to be at. But if we keep running evolution forward, I can imagine elephants, whales, corvids, octopuses, big cats, and maybe a few others reaching dolphin-level intelligence. But I have a hard time picturing any of them developing cultural evolution.

Comment by Richard_Ngo (ricraz) on Formal Inner Alignment, Prospectus · 2021-05-18T11:57:42.286Z · LW · GW

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.

Comment by Richard_Ngo (ricraz) on Formal Inner Alignment, Prospectus · 2021-05-17T14:45:03.714Z · LW · GW

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools?

In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-16T14:25:40.859Z · LW · GW

The trained AlphaZero model knows lots of things about Go, in a comparable way to how a dog knows lots of things about running.

But the algorithm that gives rise to that model can know arbitrarily few things. (After all, the laws of physics gave rise to us, but they know nothing at all.)

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-16T14:20:51.627Z · LW · GW

I'd say that this is too simple and programmatic to be usefully described as a mental model. The amount of structure encoded in the computer program you describe is very small, compared with the amount of structure encoded in the neural networks themselves. (I agree that you can have arbitrarily simple models of very simple phenomena, but those aren't the types of models I'm interested in here. I care about models which have some level of flexibility and generality, otherwise you can come up with dumb counterexamples like rocks "knowing" the laws of physics.)

As another analogy: would you say that the quicksort algorithm "knows" how to sort lists? I wouldn't, because you can instead just say that the quicksort algorithm sorts lists, which conveys more information (because it avoids anthropomorphic implications). Similarly, the program you describe builds networks that are good at Go, and does so by making use of the rules of Go, but can't do the sort of additional processing with respect to those rules which would make me want to talk about its knowledge of Go.

Comment by Richard_Ngo (ricraz) on Agency in Conway’s Game of Life · 2021-05-13T19:10:43.462Z · LW · GW

I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs

There's at least one important difference: some of these are intelligent, and some of these aren't.

It does seem plausible that the category boundary you're describing is an interesting one. But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.

Comment by Richard_Ngo (ricraz) on Agency in Conway’s Game of Life · 2021-05-13T15:41:18.131Z · LW · GW

It feels like this post pulls a sleight of hand. You suggest that it's hard to solve the control problem because of the randomness of the starting conditions. But this is exactly the reason why it's also difficult to construct an AI with a stable implementation. If you can do the latter, then you can probably also create a much simpler system which creates the smiley face.

Similarly, in the real world, there's a lot of randomness which makes it hard to carry out tasks. But there are a huge number of strategies for achieving things in the world which don't require instantiating an intelligent controller. For example, trees and bacteria started out small but have now radically reshaped the earth. Do they count as having "perception, cognition, and action that are recognizably AI-like"?

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-13T15:27:20.690Z · LW · GW

The human knows the rules and the win condition. The optimisation algorithm doesn't, for the same reason that evolution doesn't "know" what dying is: neither are the types of entities to which you should ascribe knowledge.

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-13T15:23:02.837Z · LW · GW

it's not obvious to me that this is a realistic target

Perhaps I should instead have said: it'd be good to explain to people why this might be a useful/realistic target. Because if you need propositions that cover all the instincts, then it seems like you're basically asking for people to revive GOFAI.

(I'm being unusually critical of your post because it seems that a number of safety research agendas lately have become very reliant on highly optimistic expectations about progress on interpretability, so I want to make sure that people are forced to defend that assumption rather than starting an information cascade.)

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-11T10:26:58.485Z · LW · GW

As an additional reason for the importance of tabooing "know", note that I disagree with all three of your claims about what the model "knows" in this comment and its parent.

(The definition of "know" I'm using is something like "knowing X means possessing a mental model which corresponds fairly well to reality, from which X can be fairly easily extracted".)

Comment by Richard_Ngo (ricraz) on Challenge: know everything that the best go bot knows about go · 2021-05-11T10:24:57.979Z · LW · GW

I think at this point you've pushed the word "know" to a point where it's not very well-defined; I'd encourage you to try to restate the original post while tabooing that word.

This seems particularly valuable because there are some versions of "know" for which the goal of knowing everything a complex model knows seems wildly unmanageable (for example, trying to convert a human athlete's ingrained instincts into a set of propositions). So before people start trying to do what you suggested, it'd be good to explain why it's actually a realistic target.

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-05-06T16:00:24.440Z · LW · GW

I used to define "agent" as "both a searcher and a controller"

Oh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important.

I'm not sure what you meant by "more compressed".

Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the human as being an agent separate from its environment.

The rest of this comment is me thinking out loud in a somewhat incoherent way; no pressure to read/respond.

It seems like calling something a "searcher" describes only a very simple interface: at the end of the search, there needs to be some representation of the output which it has found. But that output may be very complex.

Whereas calling something a "controller" describes a much more complex interface between it and its environment: you need to be able to point not just to outcomes, but also to observations and actions. But each of those actions is usually fairly simple for a pure controller; if it's complex, then you need search to find which action to take at each step.

Now, it seems useful to sometimes call evolution a controller. For example, suppose you're trying to wipe out a virus, but it keeps mutating. Then there's a straightforward sense in which evolution is "steering" the world towards states where the virus still exists, in the short term. You could also say that it's steering the world towards states where all organisms have high fitness in the long term, but organisms are so complex that it's easier to treat them as selected outcomes, and abstract away from the many "actions" by evolution which led to this point.

In other words, evolution searches using a process of iterative control. Whereas humans control using a process of iterative search.

(As a side note, I'm now thinking that "search" isn't quite the right word, because there are other ways to do selection than search. For example, if I construct a mathematical proof (or a poem) by writing it one line at a time, letting my intuition guide me, then it doesn't really seem accurate to say that I'm searching over the space of proofs/poems. Similarly, a chain of reasoning may not branch much, but still end up finding a highly specific conclusion. Yet "selection" also doesn't really seem like the right word either, because it's at odds with normal usage, which involves choosing from a preexisting set of options - e.g. you wouldn't say that a poet is "selecting" a poem. How about "design" as an alternative? Which allows us to be agnostic about how the design occurred - whether it be via a control process like evolution, or a process of search, or a process of reasoning.)

Comment by Richard_Ngo (ricraz) on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-05-05T09:52:16.955Z · LW · GW

My default would be "raahp", which doesn't have any of the problems you mentioned.

Comment by Richard_Ngo (ricraz) on Why I Work on Ads · 2021-05-04T16:03:07.867Z · LW · GW

+1 for making the case for a side that's not the one your personal feelings lean towards.

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-04-30T16:12:10.279Z · LW · GW

To me it sounds like you're describing (some version of) agency, and so the most natural term to use would be mesa-agent.

I'm a bit confused about the relationship between "optimiser" and "agent", but I tend to think of the latter as more compressed, and so insofar as we're talking about policies it seems like "agent" is appropriate. Also, mesa-optimiser is taken already (under a definition which assumes that optimisation is equivalent to some kind of internal search).

Comment by Richard_Ngo (ricraz) on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2021-04-30T09:47:55.697Z · LW · GW

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

What's your specific critique of this? I think it's an interesting and insightful point.

Comment by Richard_Ngo (ricraz) on Coherence arguments imply a force for goal-directed behavior · 2021-04-28T13:52:27.097Z · LW · GW

My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts.

Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)

I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other.

I don't think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)

Comment by Richard_Ngo (ricraz) on Coherence arguments imply a force for goal-directed behavior · 2021-04-28T09:53:32.152Z · LW · GW

Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?

Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)

But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle - for example, you say "our aim is to build an artificial liver with the most HP possible" - then I'm worried that this would harm your ability to understand what a healthy liver looks like on the level of cells, or tissues, or metabolic pathways, or roles within the digestive system. Because HP is just not a well-defined concept at that level of resolution.

Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole - concepts which people just don't talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they're making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-04-26T09:16:20.388Z · LW · GW

Do you think that's a problem?

I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.

I guess the mesa- prefix helps point towards the fact that we're talking about policies, not policies + optimisers.

Probably my preferred terminology would be:

  • Instead of mesa-controller, "competent policy".
  • And then we can say that competent policies sometimes implement search or learning (or both, or possibly neither).
  • And when we need to be clear, we can add the mesa- prefix to search or learning. (Although I'm not sure whether something like AlphaGo is a mesa-searcher - does the search need to be emergent?)

This helps make it clear that mesa-controller isn't a disjoint category from mesa-searcher, and also that mesa-controller is the default, rather than a special case.

Having written all this I'm now a little confused about the usefulness of the mesa-optimisation terminology at all, and I'll need to think about it more. In particular, it's currently unclear to me what the realistic alternative to mesa-optimisation is, which makes me wonder if it's actually carving off an important set of possibilities, or just reframing the whole space of possibilities. (If the policy receives a gradient update every minute, is it useful to call it a mesa-optimiser? Or every hour? Or...)

Comment by Richard_Ngo (ricraz) on Gradations of Inner Alignment Obstacles · 2021-04-23T09:55:06.270Z · LW · GW

Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.

I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?

Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-15T19:16:45.377Z · LW · GW

You’re saying things like ‘provocative’ and ‘mindkilling’ and ‘invoking tribal loyalties’, but you’ve not made any arguments relating that to my writing

I should be clear here that I'm talking about a broader phenomenon, not specifically your writing. As I noted above, your post isn't actually a central example of the phenomenon. The "tribal loyalties" thing was primarily referring to people's reactions to the SSC/NYT thing. Apologies if it seemed like I was accusing you personally of all of these things. (The bits that were specific to your post were mentions of "evil" and "disgust".)

Nor am I saying that we should never talk about emotions; I do think that's important. But we should try to also provide argumentative content which isn't reliant on the emotional content. If we make strong claims driven by emotions, then we should make sure to also defend them in less emotionally loaded ways, in a way which makes them compelling to someone who doesn't share these particular emotions. For example, in the quotation you gave, what makes science's principles "fake" just because they failed in psychology? Is that person applying an isolated demand for rigour because they used to revere science? I can only evaluate this if they defend their claims more extensively elsewhere.

On the specific example of facebook, I disagree that you're using evil in a central way. I think the central examples of evil are probably mass-murdering dictators. My guess is that opinions would be pretty divided about whether to call drug dealers evil (versus, say, amoral); and the same for soldiers, even when they end up causing a lot of collateral damage.

Your conclusion that facebook is evil seems particularly and unusually strong because your arguments are also applicable to many TV shows, game producers, fast food companies, and so on. Which doesn't make those arguments wrong, but it means that they need to meet a pretty high bar, since either facebook is significantly more evil than all these other groups, or else we'll need to expand the scope of words like "evil" until they refer to a significant chunk of society (which would be quite different from how most people use it).

(This is not to over-focus on the specific word "evil", it's just the one you happened to use here. I have similar complaints about other people using the word "insane" gratuitously; to people casually comparing current society to Stalinist Russia or the Cultural Revolution; and so on.)

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-15T09:19:41.424Z · LW · GW

Whether I agree with this point or not depends on whether you're using Ben's framing of the costs and benefits, or the framing I intended; I can't tell.

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-15T09:06:27.999Z · LW · GW

I think we're talking past each other a little, because we're using "careful" in two different senses. Let's say careful1 is being careful to avoid reputational damage or harassment. Careful2 is being careful not to phrase claims in ways that make it harder for you or your readers to be rational about the topic (even assuming a smart, good-faith audience).

It seems like you're mainly talking about careful1. In the current context, I am not worried about backlash or other consequences from failure to be careful1. I'm talking about careful2. When you "aim to say valuable truths that aren't said elsewhere", you can either do so in a way that is careful2 to be nuanced and precise, or you can do so in a way that is tribalist and emotionally provocative and mindkilling. From my perspective, the ability to do the former is one of the core skills of rationality.

In other words, it's not just a question of the "worst" interpretation of what you write; rather, I think that very few people (even here) are able to dispassionately evaluate arguments which call things "evil" and "disgusting", or which invoke tribal loyalties. Moreover, such arguments are often vague because they appeal to personal standards of "evil" or "insane" without forcing people to be precise about what they mean by it (e.g. I really don't know what you actually mean when you say facebook is evil). So even if your only goal is to improve your personal understanding of what you're writing about, I would recommend being more careful2.

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-14T21:03:51.048Z · LW · GW

I'm saying we should strive to do better than Twitter on the metric of "being careful with strongly valenced terminology", i.e. being more careful. I'm not quite sure what point you're making - it seems like you think it'd be better to be less careful?

In any case, the reference to Twitter was just a throwaway example; my main argument is that our standards for longer form discussions on Lesswrong should involve being more careful with strongly valenced terminology than people currently are.

Comment by Richard_Ngo (ricraz) on Against "Context-Free Integrity" · 2021-04-14T17:34:01.233Z · LW · GW

There's something that's been bugging me lately about the rationalist discourse on moral mazes, political power structures, the NYT/SSC kerfuffle, etc. People are making unusually strong non-consequentialist moral claims without providing concomitantly strong arguments, or acknowledging the ways in which this is a judgement-warping move.

I don't think that being non-consequentialist is always wrong. But I do think that we have lots of examples of people being blinded by non-consequentialist moral intuitions, and it seems like rationalists around me are deliberately invoking risk factors. Some of the risk factors: strong language, tribalism, deontological rules, judgements about the virtue of people or organisations, and not even trying to tell a story about specific harms.

Your post isn't a central example of this, but it seems like your argument is closely related to this phenomenon, and there are also a few quotes from your post which directly showcase the thing I'm criticising:

they would be notably more disgusted with the parts of that system they interacted with


They'd rather believe the things around them are pretty good rather than kinda evil. Evil means accounting, and accounting is boooring.


The first time was with Facebook, where he was way in advance of me coming to realize what was evil about it.

"Evil" is one of the most emotionally loaded words in the english language. Disgust is one of the most visceral and powerful emotions. Neither you nor I nor other readers are immune to having our judgement impaired by these types of triggers, especially when they're used regularly. (Edit: to clarify, I'm not primarily worried about worst-case interpretations; I'm worried about basically everyone involved.)

Now, I'm aware of major downsides of being too critical of strong language and bold claims. But being careful of gratuitously using words like "evil" and "insane" and "Stalinist" isn't an usually high bar; even most people on Twitter manage it.

Other closely-related examples: people invoking anti-media tribalism in defence of SSC; various criticisms of EA for not meeting highly scrupulous standards of honesty (using words like "lying" and "scam"); talking about "attacks" and "wars"; taking hard-line views on privacy and the right not to be doxxed; etc.

Oh, and I should also acknowledge that my calls for higher epistemic standards are driven to a significant extent by epistemically-deontological intuitions. And I do think this has warped my judgement somewhat, because those intuitions lead to strong reactions to people breaking the "rules". I think the effect is likely to be much stronger when driven by moral (not just epistemic) intuitions, as in the cases discussed above.

Comment by Richard_Ngo (ricraz) on The Counterfactual Prisoner's Dilemma · 2021-04-06T09:49:34.019Z · LW · GW

Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F).

The problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hitchhiker, my decision not to pay after being picked up logically causes me not to be picked up. The result of that decision would be a counterpossible world: a world in which the same decision algorithm outputs one thing at one point, and a different thing at another point. But in counterfactual mugging, if you choose not to pay, then this doesn't result in a counterpossible world.

I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked.

The whole point of functional decision theory is that it's very unlikely for these two policies to differ. For example, consider the Twin Prisoner's Dilemma, but where the walls of one room are green, and the walls of the other are blue. This shouldn't make any difference to the outcome: we should still expect both agents to cooperate, or both agents to defect. But the same is true for heads vs tails in Counterfactual Prisoner's Dilemma - they're specific details which distinguish you from your counterfactual self, but don't actually influence any decisions.

Comment by Richard_Ngo (ricraz) on The Counterfactual Prisoner's Dilemma · 2021-04-05T11:50:41.383Z · LW · GW

by only considering the branches of reality that are consistent with our knowledge

I know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me $10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money.

So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into account when reasoning about logical causation", then Counterfactual Prisoner's Dilemma doesn't give us anything new.

If by "considering [other branches of reality]" you instead mean "acting to benefit my counterfactual self", then I deny that this is what is happening in CPD. You're acting to benefit your current self, via logical causation, just like in the Twin Prisoner's Dilemma. You don't need to care about your counterfactual self at all. So it's disanalogous to Counterfactual Mugging, where the only reason to pay is to help your counterfactual self.

Comment by Richard_Ngo (ricraz) on The Counterfactual Prisoner's Dilemma · 2021-04-04T14:10:53.715Z · LW · GW

I don't see why the Counterfactual Prisoner's Dilemma persuades you to pay in the Counterfactual Mugging case. In the counterfactual prisoner's dilemma, I pay because that action logically causes Omega to give me $10,000 in the real world (via influencing the counterfactual). This doesn't require shifting the locus of evaluation to policies, as long as we have a good theory of which actions are correlated with which other actions (e.g. paying in heads-world and paying in tails-world).

In the counterfactual mugging, by contrast, the whole point is that paying doesn't cause any positive effects in the real world. So it seems perfectly consistent to pay in the counterfactual prisoner's dilemma, but not in the counterfactual mugging.

Comment by Richard_Ngo (ricraz) on Coherence arguments imply a force for goal-directed behavior · 2021-03-30T15:01:55.394Z · LW · GW

Thanks for writing this post, Katja; I'm very glad to see more engagement with these arguments. However, I don't think the post addresses my main concern about the original coherence arguments for goal-directedness, which I'd frame as follows:

There's some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I'll call behavioural EU-maximisation) is inadequate as a theory of goal-directedness, because it applies to any agent. In order to fix this problem, the main missing link is not a stronger (probabilistic) argument for why AGIs will be coherent EU-maximisers, but rather an explanation of what it even means for a real-world agent to be a coherent EU-maximiser, which we don't currently have.

By "behavioural EU-maximisation", I mean thinking of a utility function as something that we define purely in terms of an agent's behaviour. In response to this, you identify an alternative definition of expected utility maximisation which isn't purely behavioural, but also refers to an agent's internal features:

An outside observer being able to rationalize a sequence of observed behavior as coherent doesn’t mean that the behavior is actually coherent. Coherence arguments constrain combinations of external behavior and internal features—‘preferences’ and beliefs. So whether an actor is coherent depends on what preferences and beliefs it actually has.

But you don't characterise those internal features in a satisfactory way, or point to anyone else who does. The closest you get is in your footnote, where you fall back on a behavioural definition of preferences:

When exactly an aspect of these should be considered a ‘preference’ for the sake of this argument isn’t entirely clear to me, but would seem to depend on something like whether it tends to produce actions favoring certain outcomes over other outcomes across a range of circumstances

I'm sympathetic to this, because it's hard to define preferences without reference to behaviour. We just don't know enough about cognitive science yet to do so. But it means that your conception of EU-maximisation is still vulnerable to Rohin's criticisms of behavioural EU-maximisation, because you still have to extract preferences from behaviour.

From my perspective, then, claims like "Anything that weakly has goals has reason to reform to become an EU maximizer" (as made in this comment) miss the crux of the disagreement. It's not that I believe the claim is false; I just don't know what it means, and I don't think anyone else does either. Unfortunately the fact that their are theorems about EU maximisation in some restricted formalisms make people think that it's a concept which is well-defined in real-world agents to a much greater extent than it actually is.

Here's an exaggerated analogy to help convey what I mean by "well-defined concept". Characters in games often have an attribute called health points (HP), and die when their health points drop to 0. Conceivably you could prove a bunch of theorems about health points in a certain class of games, e.g. that having more is always good. Okay, so is having more health points always good for real-world humans (or AIs)? I mean, we must have something like the health point formalism used in games, because if we take too much damage, we die! Sure, some critics say that defining health points in terms of external behaviour (like dying) is vacuous - but health points aren't just about behaviour, we can also define them in terms of an agent's internal features (like the tendency to die in a range of circumstances).

I would say that EU is like "health points": a concept which is interesting to reason about in some formalisms, and which is clearly related to an important real-world concept, but whose relationship to that non-formal real-world concept we don't yet understand well. Perhaps continued investigation can fix this; I certainly hope so! But in the meantime, using "EU-maximisation" instead of "goal-directedness" feels similar to using "health points" as a substitute for "health" - its main effect is to obscure our conceptual confusion under a misleading layer of formalism, thereby making the associated arguments seem stronger than they actually are.

Comment by Richard_Ngo (ricraz) on Against evolution as an analogy for how humans will create AGI · 2021-03-25T13:58:51.995Z · LW · GW

I personally found this post valuable and thought-provoking. Sure, there's plenty that it doesn't cover, but it's already pretty long, so that seems perfectly reasonable.

I particularly I dislike your criticism of it as strawmanish. Perhaps that would be fair if the analogy between RL and evolution were a standard principle in ML. Instead, it's a vague idea that is often left implicit, or else formulated in idiosyncratic ways. So posts like this one have to do double duty in both outlining and explaining the mainstream viewpoint (often a major task in its own right!) and then criticising it. This is most important precisely in the cases where the defenders of an implicit paradigm don't have solid articulations of it, making it particularly difficult to understand what they're actually defending. I think this is such a case.

If you disagree, I'd be curious what you consider a non-strawmanish summary of the RL-evolution analogy. Perhaps Clune's AI-GA paper? But from what I can tell opinions of it are rather mixed, and the AI-GA terminology hasn't caught on.

Comment by Richard_Ngo (ricraz) on Against evolution as an analogy for how humans will create AGI · 2021-03-25T13:23:52.874Z · LW · GW

there’s a “solving the problem twice” issue. As mentioned above, in Case 5 we need both the outer and the inner algorithm to be able to do open-ended construction of an ever-better understanding of the world—i.e., we need to solve the core problem of AGI twice with two totally different algorithms! (The first is a human-programmed learning algorithm, perhaps SGD, while the second is an incomprehensible-to-humans learning algorithm. The first stores information in weights, while the second stores information in activations, assuming a GPT-like architecture.)

Cross-posting a (slightly updated) comment I left on a draft of this document:

I suspect that this is indexed too closely to what current neural networks look like. I see no good reason why the inner algorithm won't eventually be able to change the weights as well, as in human brains. (In fact, this might be a crux for me - I agree that the inner algorithm having no ability to edit the weights seems far-fetched).

So then you might say that we've introduced a disanalogy to evolution, because humans can't edit our genome.

But the key reason I think that RL is roughly analogous to evolution is because it shapes the high-level internal structure of a neural network in roughly the same way that evolution shapes the high-level internal structure of the human brain, not because there's a totally strict distinction between levels.

E.g. the thing RL currently does, which I don't expect the inner algorithm to be able to do, is make the first three layers of the network vision layers, and then a big region over on the other side the language submodule, and so on. And eventually I expect RL to shape the way the inner algorithm does weight updates, via meta-learning.

You seem to expect that humans will be responsible for this sort of high-level design. I can see the case for that, and maybe humans will put in some modular structure, but the trend has been pushing the other way. And even if humans encode a few big modules (analogous to, say, the distinction between the neocortex and the subcortext), I expect there to be much more complexity in how those actually work which is determined by the outer algorithm (analogous to the hundreds of regions which appear across most human brains).

Comment by Richard_Ngo (ricraz) on Against evolution as an analogy for how humans will create AGI · 2021-03-25T12:57:12.796Z · LW · GW

It seems totally plausible to give AI systems an external memory that they can read to / write from, and then you learn linear algebra without editing weights but with editing memory. Alternatively, you could have a recurrent neural net with a really big hidden state, and then that hidden state could be the equivalent of what you're calling "synapses".

I agree with Steve that it seems really weird to have these two parallel systems of knowledge encoding the same types of things. If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there?

I do expect that we will continue to update AGI systems via editing weights in training loops, even after deployment. But this will be more like an iterative train-deploy-train-deploy cycle where each deploy step lasts e.g. days or more, rather than editing weights all the time (as with humans).

Based on this I guess your answer to my question above is "no": the original fact will get overridden a few days later, and also the knowledge of french will be transferred into the weights eventually. But if those updates occur via self-supervised learning, then I'd count that as "autonomously edit[ing] its weights after training". And with self-supervised learning, you don't need to wait long for feedback, so why wouldn't you use it to edit weights all the time? At the very least, that would free up space in the short-term memory/hidden state.

For my own part I'm happy to concede that AGIs will need some way of editing their weights during deployment. The big question for me is how continuous this is with the rest of the training process. E.g. do you just keep doing SGD, but with a smaller learning rate? Or will there be a different (meta-learned) weight update mechanism? My money's on the latter. If it's the former, then that would update me a bit towards Steve's view, but I think I'd still expect evolution to be a good analogy for the earlier phases of SGD.  

Maybe we just won't have AGI that learns by reading books, and instead it will be more useful to have a lot of task-specific AI systems with a huge amount of "built-in" knowledge, similarly to GPT-3.

If this is the case, then that would shift me away from thinking of evolution as a good analogy for AGI, because the training process would then look more like the type of skill acquisition that happens during human lifetimes. In fact, this seems like the most likely way in which Steve is right that evolution is a bad analogy.

Comment by Richard_Ngo (ricraz) on The case for aligning narrowly superhuman models · 2021-03-06T14:43:22.890Z · LW · GW

Nice post. The one thing I'm confused about is:

Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objections you outlined? But your responses to them seem persuasive to me - and more generally, the objections don't seem to address the fact that a bunch of people who are trying to solve long-term alignment problems actually ended up doing this research. So I'd be interested to hear elaborations and defences of those objections from people who find them compelling.

Comment by Richard_Ngo (ricraz) on Book review: "A Thousand Brains" by Jeff Hawkins · 2021-03-05T17:34:22.338Z · LW · GW

Great post, and I'm glad to see the argument outlined in this way. One big disagreement, though:

the Judge box will house a relatively simple algorithm written by humans

I expect that, in this scenario, the Judge box would house a neural network which is still pretty complicated, but which has been trained primarily to recognise patterns, and therefore doesn't need "motivations" of its own.

This doesn't rebut all your arguments for risk, but it does reframe them somewhat. I'd be curious to hear about how likely you think my version of the judge is, and why.

Comment by Richard_Ngo (ricraz) on Takeaways from one year of lockdown · 2021-03-03T06:06:57.437Z · LW · GW

Thanks for the post; I think this type of reflection is very valuable. The main takeaway from this line of thought for me is that we're in a community which selects for scrupulosity and caution as character traits, which then have a big impact on how we think about risks. This has various implications for thinking about AI, which I won't get into here.

Comment by Richard_Ngo (ricraz) on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2021-03-02T09:49:41.469Z · LW · GW

Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?

Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.

Comment by Richard_Ngo (ricraz) on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2021-02-28T20:38:52.789Z · LW · GW

Above you say:

Now, the basic problem: our agent’s utility function is mostly a function of latent variables. ... Those latent variables:

  • May not correspond to any particular variables in the AI’s world-model and/or the physical world
  • May not be estimated by the agent at all (because lazy evaluation)
  • May not be determined by the agent’s observed data

… and of course the agent’s model might just not be very good, in terms of predictive power.

And you also discuss how:

Human "values" are defined within the context of humans' world-models, and don't necessarily make any sense at all outside of the model.

My two concerns are as follows. Firstly, that the problems mentioned in these quotes above are quite different from the problem of constructing a feedback signal which points to a concept which we know an AI already possesses. Suppose that you meet an alien and you have a long conversation about the human concept of happiness, until you reach a shared understanding of the concept. In other words, you both agree on what "the referents of these pointers" are, and what "the real-world things (if any) to which they’re pointing" are? But let's say that the alien still doesn't care at all about human happiness. Would you say that we have a "pointer problem" with respect to this alien? If so, it's a very different type of pointer problem than the one you have with respect to a child who believes in ghosts. I guess you could say that there are two different but related parts of the pointer problem? But in that case it seems valuable to distinguish more clearly between them.

My second concern is that requiring pointers to be sufficient to "to get the AI to do what we mean" means that they might differ wildly depending on the motivation system of that specific AI and the details of "what we mean". For example, imagine if alien A is already be willing to obey any commands you give, as long as it understands them; alien B can be induced to do so via operant conditioning; alien C would only acquire human values via neurosurgery; alien D would only do so after millennia of artificial selection. So in the context of alien A, a precise english phrase is a sufficient pointer; for alien B, a few labeled examples qualifies as a pointer; for alien C, identifying a specific cluster of neurons (and how it's related to surrounding neurons) serves as a pointer; for alien D, only a millennium of supervision is a sufficient pointer. And then these all might change when we're talking about pointing to a different concept. 

And so adding the requirement that a pointer can "get the AI to do what we mean" makes it seem to me like the thing we're talking about is more like a whole alignment scheme than just a "pointer".

Comment by Richard_Ngo (ricraz) on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2021-02-26T15:18:45.039Z · LW · GW

The question then is, what would it mean for such an AI to pursue our values?

Why isn't the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we'd reflectively endorse;
3. Use those concepts?

The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don't need to be "real in a reductionist sense" for me to want more of them.

Comment by Richard_Ngo (ricraz) on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2021-02-26T15:13:12.094Z · LW · GW

I agree with all the things you said. But you defined the pointer problem as: "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?" In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

The problem of determining how to construct a feedback signal which refers to those variables, once we've found them, seems like a different problem. Perhaps I'd call it the "motivation problem": given a function of variables in an agent's world-model, how do you make that agent care about that function? This is a different problem in part because, when addressing it, we don't need to worry about stuff like ghosts.

Using this terminology, it seems like the alignment problem reduces to the pointer problem plus the motivation problem.

Comment by Richard_Ngo (ricraz) on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2021-02-25T19:31:33.747Z · LW · GW

I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world. I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now.

It seems likely that an AGI will understand very well what I mean when I use english words to describe things, and also what a more intelligent version of me with more coherent concepts would want those words to actually refer to. Why does this not imply that the pointers problem will be solved?

I agree that there's something like what you're describing which is important, but I don't think your description pins it down.

Comment by Richard_Ngo (ricraz) on Distinguishing claims about training vs deployment · 2021-02-25T17:09:17.984Z · LW · GW

I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are.

If I were to put my objection another way: I usually interpret "robust" to mean something like "stable under perturbations". But the perturbation of "change the environment, and then see what the new optimal policy is" is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent's inputs, or its state, and seeing whether it still behaved instrumentally.

A more accurate description might be something like "ubiquitous instrumentality"? But this isn't a very aesthetically pleasing name.

Comment by Richard_Ngo (ricraz) on Distinguishing claims about training vs deployment · 2021-02-22T15:34:22.065Z · LW · GW

Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.

The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you're trying to do the former, but because "robust" modifies "instrumentality", the latter is a more natural interpretation.

For example, if I said "life on earth is very robust", the natural interpretation is: given that life exists on earth, it'll be hard to wipe it out. Whereas an emergence-focused interpretation (like yours) would be: life would probably have emerged given a wide range of initial conditions on earth. But I imagine that very few people would interpret my original statement in that way.

The second ambiguity I dislike: even if we interpret "robust instrumentality" as the claim that "the emergence of instrumentality is robust", this still doesn't get us what we want. Bostrom's claim is not just that instrumental reasoning usually emerges; it's that specific instrumental goals usually emerge. But "instrumentality" is more naturally interpreted as the general tendency to do instrumental reasoning.

On switching costs: Bostrom has been very widely read, so changing one of his core terms will be much harder than changing a niche working handle like "optimisation daemon", and would probably leave a whole bunch of people confused for quite a while. I do agree the original term is flawed though, and will keep an eye out for potential alternatives - I just don't think robust instrumentality is clear enough to serve that role.

Comment by Richard_Ngo (ricraz) on Distinguishing claims about training vs deployment · 2021-02-11T05:54:39.194Z · LW · GW

Yepp, this is a good point. I agree that there won't be a sharp distinction, and that ML systems will continue to do online learning throughout deployment. Maybe I should edit the post to point this out. But three reasons why I think the training/deployment distinction is still underrated:

  1. In addition to the clarifications from this post, I think there are a bunch of other concepts (in particular recursive self-improvement and reward hacking) which weren't originally conceived in the context of modern ML, but which it's very important to understand in the context of ML.
  2. Most ML and safety research doesn't yet take transfer learning very seriously; that is, it's still in the paradigm where you train in (roughly) the environment that you measure performance on. Emphasising the difference between training and deployment helps address this. For example, I've pointed out in various places that there may be no clear concept of "good behaviour" during the vast majority of training, potentially undermining efforts to produce aligned reward functions during training.
  3. It seems reasonable to expect that early AGIs will become generally intelligent before being deployed on real-world tasks; and that their goals will also be largely determined before deployment. And therefore, insofar as what we care about is giving them the right underying goals, then the relatively small amount of additional supervision they'll gain during deployment isn't a primary concern.