Comment by paulfchristiano on Towards formalizing universality · 2019-01-17T20:34:02.668Z · score: 4 (2 votes) · LW · GW

Suppose that I convinced you "if you didn't know much chemistry, you would expect this AI to yield good outcomes." I think you should be pretty happy. It may be that the AI would predictably cause a chemistry-related disaster in a way that would be obvious to you if you knew chemistry, but overall I think you should expect not to have a safety problem.

This feels like an artifact of a deficient definition, I should never end up with a lemma like "if you didn't know much chemistry, you'd expect this AI to to yield good outcomes" rather than being able to directly say what we want to say.

That said, I do see some appeal in proving things like "I expect running this AI to be good," and if we are ever going to prove such statements they are probably going to need to be from some impoverished perspective (since it's too hard to bring all of the facts about our actual epistemic state into such a proof), so I don't think it's totally insane.

If we had a system that is ascription universal from some impoverished perspective, you may or may not be OK. I'm not really worrying about it; I expect this definition to change before the point where I literally end up with a system that is ascription universal from some impoverished perspective, and this definition seems good enough to guide next research steps.

Comment by paulfchristiano on Towards formalizing universality · 2019-01-16T21:28:44.260Z · score: 4 (2 votes) · LW · GW

In order to satisfy this definition, 𝔼¹ needs to know every particular fact 𝔼 knows. It would be nice to have a definition that got at the heart of the matter while relaxing this requirement.

I don't think your condition gets around this requirement. Suppose that Y is a bit that 𝔼 knows and 𝔼¹ does not, Z[0] and Z[1] are two hard-to-estimate quantities (that 𝔼¹ and 𝔼² know but 𝔼 does not), and that X=Z[Y].

Comment by paulfchristiano on The E-Coli Test for AI Alignment · 2019-01-16T20:10:53.117Z · score: 4 (2 votes) · LW · GW
Perhaps you say “these cells are too simple, they can’t learn/reflect/etc.” Well, chances are humans will have the same issue once the computational burden gets large enough.

I don't think the situations is symmetrical here.

Humans have easy-to-extract preferences over possible "wiser versions of ourselves." That is, you can give me a menu of slightly modified versions of myself, and I can try to figure out which of those best capture my real values (or over what kind of process should be used for picking which of those best capture my real values, or etc.). Those wiser versions of ourselves can in turn have preferences over even wiser/smarter versions of ourselves, and we can hope that the process might go on ad infinitum.

It may be that the process with humans eventually hits a ceiling---we prefer that we become smarter and wiser in some obvious ways, but then eventually we've picked the low hanging fruit and we are at a loss for thinking about how to change without compromising our values. Or it may be that we are wrong about our preferences, and that iterating this deliberative process goes somewhere crazy.

But those are pretty fundamentally different from the situation with E. coli, where we have no way to even get the process started. In particular, the difficulty of running the process with E. coli doesn't give us much information about whether the process with humans would top out or go off the rails, once we know that humans are able to get the process started.

The reward engineering problem

2019-01-16T18:47:24.075Z · score: 18 (3 votes)
Comment by paulfchristiano on Towards formalizing universality · 2019-01-16T18:16:39.658Z · score: 4 (2 votes) · LW · GW
What does "expect" mean here? Probability > .5?

We don't think that C has any expected edge in identifying an attack that A does not---if C is up to something, and A can't tell, then we think that the "being up to something" provides no evidence about whether something bad is about to happen. This is much stronger than a fixed probability of failure.

Also, can you walk through an example of how C might "trick" A[C] and how subjective dominance implies that we don't expect C to succeed?

Simple caricatured examples:

  • C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
  • C might propose a design that exploits a predictable flaw in A's reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.
I don't know how to make sense of this. If an impoverished perspective expects C not to be able to "trick" A, it seems kind of obvious that's not good enough for me to think the same?

If an impoverished perspective expects C not to be able to trick A, then either:

  • In fact C won't be able to trick A.
  • C will trick A, but the perspective is too weak to tell.

I think I don't quite understand what you are saying here, what exactly is obvious?

From a suitably advanced perspective it's obvious that C will be able to trick A sometimes---it will just get "epistemically lucky" and make an assumption that A regards as silly but turns out to be right.

Comment by paulfchristiano on Towards formalizing universality · 2019-01-16T08:24:49.486Z · score: 4 (2 votes) · LW · GW

I'm aiming for things like:

  • round debate dominating any fast computation with alternations (including an round debate)
  • max-HCH with budget dominating max-HCH with budget for some constant .
  • HCH with advice and budget dominating HCH with no advice and budget .

Comment by paulfchristiano on Towards formalizing universality · 2019-01-15T02:24:20.671Z · score: 4 (2 votes) · LW · GW

Yes, thanks.

Comment by paulfchristiano on Towards formalizing universality · 2019-01-14T03:50:22.981Z · score: 4 (2 votes) · LW · GW
does an algorithm that adds two numbers have a belief about the rules of addition? Does a GIF to JPEG converter have a belief about which image format is "better"?

I'm not assuming any fact of the matter about what beliefs an system has. I'm quantifying over all "reasonable" ways of ascribing beliefs. So the only question is which ascription procedures are reasonable.

I think the most natural definition is to allow an ascription procedure to ascribe arbitrary fixed beliefs. That is, we can say that an addition algorithm has beliefs about the rules of addition, or about what kinds of operations will please God, or about what kinds of triples of numbers are aesthetically appealing, or whatever you like.

Universality requires dominating the beliefs produced by any reasonable ascription procedure, and adding particular arbitrary beliefs doesn't make an ascription procedure harder to dominate (so it doesn't really matter if we allow the procedures in the last paragraph as reasonable). The only thing that makes it hard to dominate C is the fact that C can do actual work that causes its beliefs to be accurate.

their inner workings are not immediately obvious

OK, consider the theorem prover that randomly searches over proofs then?

Comment by paulfchristiano on Towards formalizing universality · 2019-01-14T02:43:32.833Z · score: 7 (3 votes) · LW · GW

C is an arbitrary computation, to be universal the humans must be better informed than *any* simple enough computation C.

Comment by paulfchristiano on Towards formalizing universality · 2019-01-13T23:36:18.510Z · score: 4 (2 votes) · LW · GW

The examples in the post are a chess-playing algorithm, image classification, and (more fleshed out) deduction, physics modeling, and the SDP solver

The deduction case is probably the simplest; our system is manipulating a bunch of explicitly-represented facts according to the normal rules of logic, we ascribe beliefs in the obvious way (i.e. if it deduces X, we say it believes X).

Towards formalizing universality

2019-01-13T20:39:21.726Z · score: 29 (6 votes)

Directions and desiderata for AI alignment

2019-01-13T07:47:13.581Z · score: 29 (6 votes)

Ambitious vs. narrow value learning

2019-01-12T06:18:21.747Z · score: 18 (4 votes)
Comment by paulfchristiano on AlphaGo Zero and capability amplification · 2019-01-12T02:12:29.473Z · score: 8 (4 votes) · LW · GW
I think all the three estimates mentioned there correspond to marginal probabilities (rather than probabilities conditioned on "no governance interventions"). So those estimates already account for scenarios in which governance interventions save the world. Therefore, it seems we should not strongly update against the necessity of governance interventions due to those estimates being optimistic

I normally give ~50% as my probability we'd be fine without any kind of coordination.

Comment by paulfchristiano on What is narrow value learning? · 2019-01-11T00:43:37.579Z · score: 4 (2 votes) · LW · GW
Why use IRL instead of behavioral cloning, where you mimic the actions that the demonstrator took?

IRL also can produce different actions at equilibrium (given finite capacity), it's not merely an inductive bias.

E.g. suppose the human does X half the time and Y half the time, and the agent can predict the details of X but not Y. Behavioral cloning then does X half the time, and half the time does some crazy thing where it's trying to predict Y but can't. IRL will just learn that it can get OK reward by outputting X (since otherwise the human wouldn't do it) and will avoid trying to do things it can't predict.

Comment by paulfchristiano on AlphaGo Zero and capability amplification · 2019-01-09T17:14:59.281Z · score: 4 (2 votes) · LW · GW

I agree there is a real sense in which AGZ is "better-grounded" (and more likely to be stable) than iterated amplification in general. (This was some of the motivation for the experiments here.)

AlphaGo Zero and capability amplification

2019-01-09T00:40:13.391Z · score: 25 (9 votes)

Supervising strong learners by amplifying weak experts

2019-01-06T07:00:58.680Z · score: 28 (7 votes)
Comment by paulfchristiano on Does anti-malaria charity destroy the local anti-malaria industry? · 2019-01-06T01:18:33.754Z · score: 9 (5 votes) · LW · GW

Uncertainty about future aid introduces a cost, and certainly recipients will be better off if aid is predictable.

But if there were no externalities from production, then I think the presence of variable aid still always makes you better off on average than no aid. Worst case, you need to invest in net-producing capacity anyway (in case aid disappears), which you can finance by charging higher prices if free nets disappear.

The main problem with that is that if aid disappears, there will be a wealth transfer from net consumers to net producers. Given risk aversion, that stochastic transfer is bad for everyone. So you'd either want to insure against aid variability, or purchase an option on nets in advance. If you can't do either of those things but nets can be stored, then you can literally manufacture the nets in advance and sell them to people who are concerned that net prices may go up, and that's still a Pareto improvement. If you can't do that either, then you could lose, but realistically I think rational expectations is the weaker link here :)

Comment by paulfchristiano on Does anti-malaria charity destroy the local anti-malaria industry? · 2019-01-05T20:44:26.531Z · score: 16 (5 votes) · LW · GW

I don't know anything about the particular case of net production. I think that the general argument against aid is similar to the typical argument for protectionism, which I think is something like:

  • Local production creates local infrastructure, know-how, human capital, etc.
  • Over the long run this benefits the region much more than it benefits the producers or consumers themselves.
  • So the state has reason to subsidize local production / tax imports.

If you have usual econ 101 models (including rational expectations), then variability itself doesn't cause any trouble, the only problem is from these positive externalities. These externalities could be pretty big, it's plausible to me that they are much larger than the direct benefits to producers and consumers.

Comment by paulfchristiano on Electrons don’t think (or suffer) · 2019-01-02T20:53:14.582Z · score: 16 (8 votes) · LW · GW
Replace "particle" with "collection of particles" and you get roughly the same argument.

Not really. A collection of particles can occupy any of an astronomical number of states, while two "distinct" electrons are demonstrably identical in almost every respect. So an electron really can't have an inner life. It's pretty surprising that physics answers this question as decisively as it does, it's only possible because exact identity has a distinguished role in quantum mechanics (basically, two sequences of events can interfere constructively or destructively only when they lead to exactly identical outcomes, so we can test that swapping the location of two electrons literally doesn't change the universe at all).

That said, I basically agree with habryka that the OP doesn't really address the possible view that simple physical operations are responsible for the vast majority of moral weight (expressed here).

Comment by paulfchristiano on New safety research agenda: scalable agent alignment via reward modeling · 2019-01-02T06:22:56.959Z · score: 21 (6 votes) · LW · GW

Iterated amplification is a very general framework, describing algorithms with two pieces:

  • An amplification procedure that increases an agent's capability. (The main candidates involve decomposing a task into pieces and invoking the agent to solve each piece separately, but there are lots of ways to do that.)
  • A distillation procedure that uses a strong expert to train an efficient agent. (I usually consider semi-supervised RL, as in our paper.)

Given these two pieces, we plug them into each other: the output of distillation becomes the input to amplification, the output of amplification becomes the input to distillation. You kick off the process with something aligned, or else design the amplification step so that it works from some arbitrary initialization.

The hope is that the result is aligned because:

  • Amplification preserves alignment (or benigness, corrigibility, or some similar invariant)
  • Distillation preserves alignment, as long the expert is "smart enough" (relative to the agent they are training)
  • Amplification produces "smart enough" agents.

My research is organized around this structure---thinking about how to fill in the various pieces, about how to analyze training procedures that have this shape, about what the most likely difficulties are. For me, the main appeal of this structure is that it breaks the full problem of training an aligned AI into two subproblems which are both superficially easier (though my expectation is that at least one of amplification or distillation will end up containing almost all of the difficulty).

Recursive reward modeling fits in this framework, though my understanding is that it was arrived at mostly independently. I hope that work on iterated-amplification-in-general will be useful for analyzing recursive reward modeling, and conversely expect that experience with recursive reward learning will be informative about the prospects for iterated-amplification-in-general.

It’s not obvious to me what isn't.

Iterated amplification is intended to describe the kind of training procedure that is most natural using contemporary ML techniques. I think it's quite likely that training strategies will have this form, even if people never read anything I write. (And indeed, AGZ and ExIt were published around the same time.)

Introducing this concept was mostly intended as an analysis tool rather than a flag planting exercise (and indeed I haven't done the kind of work that would be needed to plant a flag). From the prior position of "who knows how we might train aligned AI," iterated amplification really does narrow down the space of possibilities a lot, and I think it has allowed my research to get much more concrete much faster than it otherwise would have.

I think it was probably naive to hope to separate this kind of analysis from flag planting without being much more careful about it; I hope I haven't made it too much more difficult for others to get due credit for working on ideas that happen to fit in this framework.

Debate is an instance of amplification.

Debate isn't prima facie an instance of iterated amplification, i.e. it doesn't fit in the framework I outlined at the start of this comment.

Geoffrey and I both believe that debate is nearly equivalent to iterated amplification, in the sense that probably either they will both work or neither will. So the two approaches suggest very similar research questions. This makes us somewhat more optimistic that those research questions are good ones to work on.

Factored cognition is an instance of amplification

"Factored cognition" refers to mechanisms for decomposing sophisticated reasoning into smaller tasks (quoting the link). Such mechanisms could be used for amplification, though there are other reasons you might study factored cognition.

"amplification" which is a broad framework for training ML systems with a human in the loop

The human isn't really an essential part. I think it's reasonably likely that we will use iterated amplification starting from a simple "core" for corrigible reasoning rather than starting from a human. (Though the resulting systems will presumably interact extensively with humans.)

Comment by paulfchristiano on Three AI Safety Related Ideas · 2018-12-24T20:35:28.361Z · score: 4 (2 votes) · LW · GW
(Also, just in case, is there a difference between "corrigible to" and "corrigible by"?)

No. I was just saying "corrigible by" originally because that seems more grammatical, and sometimes saying "corrigible to" because it seems more natural. Probably "to" is better.

Comment by paulfchristiano on Three AI Safety Related Ideas · 2018-12-24T20:31:21.218Z · score: 6 (3 votes) · LW · GW
this seems closer to 'overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time'.

The overseer asks the question "what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?", and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)

When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer "what should the agent do [in the situation when it has been asked question Q by the user]?"

Comment by paulfchristiano on Corrigibility · 2018-12-20T21:38:56.158Z · score: 11 (3 votes) · LW · GW

If you were building a "treaty AI" tasked with enforcing an agreement between two agents, that AI could not be corrigible by either agent, and this is a big reason that such a treaty AI seem a bit scary. Similarly if I am trying to delegate power to an AI who will honor a treaty by construction.

I often imagine a treaty AI being corrigible by some judiciary (which need not be fast/cheap enough to act as an enforcer), but of course this leaves the question of how to construct that judiciary, and the same questions come up there.

But if corrigibility implies that humans are ultimately in control of resources and therefore can override any binding commitments that an AI may make

I view this as: the problem of making binding agreements is separate from the problem of delegating to an AI. We can split the two up, and ask separately: "can we delegate effectively to an AI?" and "can we use AI to make binding commitments?" The division seems clean: if we can make binding commitments by any mechanism than we can have the committed human delegate to a (likely corrigible) AI rather than having the original human so delegate.

Comment by paulfchristiano on Two Neglected Problems in Human-AI Safety · 2018-12-20T21:32:47.243Z · score: 2 (1 votes) · LW · GW
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that's worth pursuing if it looks feasible.


Comment by paulfchristiano on Three AI Safety Related Ideas · 2018-12-20T21:31:47.457Z · score: 5 (2 votes) · LW · GW
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about "basin of attraction" seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:

Corrigibility plays a role both within amplification and in the final agent.

The post is mostly talking about the final agent without talking about IDA specifically.

The section titled Amplification is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn't seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).

I'm no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it's a good way to reason about the external behavior, and I do think "corrigible" is closer to what we want than "benign" was). I've been thinking about these issues recently and it will be touched on in an upcoming post.

Is the overseer now still training the AI to be corrigible to herself, which produces an AI that's aligned to the overseer which then helps out the user because the overseer has a preference to help out the user?

This is basically right. I'm usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like "What should the agent do next?" (with an appropriate specification of 'should'), which is where external corrigibility comes in.

Comment by paulfchristiano on Reasons compute may not drive AI capabilities growth · 2018-12-20T21:21:16.404Z · score: 11 (3 votes) · LW · GW

I think that fp32 -> fp16 should give a >5x boost on a V100, so this 5x improvement still probably hides some inefficiencies when running in fp16.

I suspect the initial 15 - > 6 hour improvement on TPUs was also mostly dealing with low hanging fruit and cleaning up various inefficiencies from porting older code to a TPU / larger batch size / etc.. It seems plausible the last factor of 2 is more of a steady state improvement, I don't know.

My take on this story would be: "Hardware has been changing rapidly, giving large speedups, and people at the same time people have been scaling up to larger batch sizes in order to spend more money. Each time hardware or scale changes, old software is poorly adapted, and it requires some engineering effort to make full use of the new setup." On this reading, these speedups don't provide as much insight into whether future progress will be driven by hardware.

Comment by paulfchristiano on Two Neglected Problems in Human-AI Safety · 2018-12-20T19:19:17.892Z · score: 9 (4 votes) · LW · GW
Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn't work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)

I'm one of the main people pushing what I regard as the most plausible approach to intent alignment, and have done a lot of thinking about that approach / built up a lot of hard-to-transfer intuition and state. So it seems like I have a strong comparative advantage on that problem.

Comment by paulfchristiano on Reasons compute may not drive AI capabilities growth · 2018-12-20T01:45:40.564Z · score: 2 (1 votes) · LW · GW
Some of this was just using more and better hardware, the winning team used 128 V100 GPUs for 18 minutes and 64 for 19 minutes, versus eight K80 GPUs for the baseline. However, substantial improvements were made even on the same hardware. The training time on a p3.16xlarge AWS instance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months.

Was the original 15 hour time for fp16 training, or fp32?

(A factor of 5 in a few months seems plausible, but before updating on that datapoint it would be good to know if it's just from switching to tensor cores which would be a rather different narrative.)

Comment by paulfchristiano on Three AI Safety Related Ideas · 2018-12-19T23:00:32.179Z · score: 23 (6 votes) · LW · GW
Overall, I'm hoping that we can solve "human safety problems" by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier.

Note that humans play two distinct roles in IDA, and I think it's important to separate them:

1. They are used to train corrigible reasoning, because we don't have a sufficiently good explicit understanding of corrigible reasoning. This role could be removed if e.g. MIRI's work on agent foundations were sufficiently successful.

2. The AI that we've trained is then tasked with the job of helping the user get what they "really" want, which is indirectly encoded in the user.

Solving safety problems for humans in step #1 is necessary to solve intent alignment. This likely involves both training (whether to reduce failure probabilities or to reach appropriate universality thresholds), and using humans in a way that is robust to their remaining safety problems (since it seems clear that most of them cannot be removed).

Solving safety problems for humans in step #2 is something else altogether. At this point you have a bunch of humans in the world who want AIs that are going to help them get what they want, and I don't think it makes that much sense to talk about replacing those humans with highly-trained supervisors---the supervisors might play a role in step #1 as a way of getting an AI that is trying to help the user get what they want, but can't replace the user themselves in step #2 . I think relevant measures at this point are things like:

  • Learn more about how to deliberate "correctly," or about what kinds of pressures corrupt human values, or about how to avoid such corruption, or etc. If more such understanding is available, then both AIs and humans can use them to avoid corruption. In the long run AI systems will do much more work on this problem than we will, but a lot of damage could be done between now and the time when AI systems are powerful enough to obsolete all of the thinking that we do today on this topic.
  • Figure out how to build AIs that are better at tasks like "help humans clarify what they really want." Differential progress in this area could be a huge win. (Again, in the long run all of the AI-design work will itself be done by AI systems, but lots of damage could be dealt in the interim as we deploy human-desinged AIs that are particularly good at manipulation relative to helping humans clarify and realize their "real" values.)
  • Change institutions/policy/environment to reduce the risk of value corruption, especially for users that don't have strong short-term preferences about how their short-term preferences change, or who don't have a clear picture of how their current choices will affect that. For example, the designers of potentially-manipulative technologies may be able to set defaults that make a huge difference in how humanity's values evolve.
  • You could also try give highly wise people more influence over what actually happens, whether by acquiring resources, earning others' trust, or whatever.
Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans who are actually in those circumstances more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).

This objection may work for some forms of idealization, but I don't think it holds up in general. If you think that experiencing X makes your views better, then your idealization can opt to experience X. The whole point of the idealization is that the idealized humans get to have the set of experiences that they believe are best for arriving at correct views, rather than a set of experiences that are constrained by technological feasibility / competitiveness constraints / etc.

(I agree that there can be some senses in which the idealization itself unavoidably "breaks" the idealized human---e.g. Vladimir Slepnev points out that an idealized human might conclude that they are most likely in a simulation, which may change their behavior; Wei Dai points out that they may behave selfishly towards the idealized human rather than towards the unidealized human, if selfishness is part of the values we'd converge to---but I don't think this is one of them.)

Comment by paulfchristiano on New safety research agenda: scalable agent alignment via reward modeling · 2018-12-19T22:32:11.440Z · score: 3 (2 votes) · LW · GW

Finding the action that optimizes a reward function is -complete for general . If the reward function is itself able to use an oracle for , then that's complete for , and so on. The analogy is loose because you aren't really getting the optimal at each step.

Comment by paulfchristiano on Two Neglected Problems in Human-AI Safety · 2018-12-19T22:24:44.648Z · score: 4 (2 votes) · LW · GW
I'm interested to know why I seem to be the first person to point out or at least publicize this seemingly obvious parallel. (Humans can be seen as a form of machine intelligence made up at least in part of a bunch of ML-like modules and "designed" with little foresight. Why wouldn't we have ML-like safety problems?)

Beyond the fact that humans have inputs on which they behave "badly" (from the perspective of our endorsed idealizations), what is the content of the analogy? I don't think there is too much disagreement about that basic claim (though there is disagreement about the importance/urgency of this problem relative to intent alignment); it's something I've discussed and sometimes work on (mostly because it overlaps with my approach to intent alignment). But it seems like the menu of available solutions, and detailed nature of the problem, is quite different than in the case of ML security vulnerabilities. So for my part that's why I haven't emphasized this parallel.

Tangentially relevant restatement of my views: I agree that there exist inputs on which people behave badly, that deliberating "correctly" is hard (and much harder than manipulating values), that there may be technologies/insights/policies that would improve the chance that we deliberate correctly or ameliorate outside pressures that might corrupt our values / distort deliberation, etc. I think we do have a mild quantitative disagreement about the relative (importance)*(marginal tractability) of various problems. I remain supportive of work in this direction and will probably write about it in more detail some point, but don't think there is much ambiguity about what I should work on.

Comment by paulfchristiano on Two Neglected Problems in Human-AI Safety · 2018-12-19T21:22:36.567Z · score: 5 (2 votes) · LW · GW
in some ways corrigibility is actually opposed to safety

We can talk about "corrigible by X" for arbitrary X. I don't think these considerations imply a tension between corrigibility and safety, they just suggest "humans in the real world" may not be the optimal X. You might prefer use an appropriate idealization of humans / humans in some safe environment / etc.

Benign model-free RL

2018-12-02T04:10:45.205Z · score: 10 (2 votes)
Comment by paulfchristiano on Iterated Distillation and Amplification · 2018-11-30T22:20:42.980Z · score: 2 (1 votes) · LW · GW

The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.

The difference with broad reinforcement learning is that you aren't trying to evaluate actions you can't understand by looking at the consequences you can observe.

Comment by paulfchristiano on Iterated Distillation and Amplification · 2018-11-30T20:43:35.456Z · score: 2 (1 votes) · LW · GW

Potentially, it depends on the time horizon and on how the rewards are calculated.

The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive "human value function," i.e. ask a human "how good does state s seem?"). This reward function wouldn't have that problem.

Comment by paulfchristiano on Iterated Distillation and Amplification · 2018-11-30T20:39:44.103Z · score: 3 (2 votes) · LW · GW

Yes, AGZ uses the same network for policy and value function.

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-28T19:55:06.660Z · score: 2 (1 votes) · LW · GW

I definitely don't mean such a narrow sense of "want my values to evolve." Seems worth using some language to clarify that.

In general the three options seem to be:

  • You care about what is "good" in the realist sense.
  • You care about what the user "actually wants" in some idealized sense.
  • You care about what the user "currently wants" in some narrow sense.

It seems to me that the first two are pretty similar. (And if you are uncertain about whether realism is true, and you'd be in the first case if you accepted realism, it seems like you'd probably be in the second case if you rejected realism. Of course that would depend on the nature of your uncertainty about realism, your views could depend on an arbitrary way on whether realism is true or false depending on what versions of realism/non-realism are competing, but I'm assuming something like the most common realist and non-realist views around here.)

To defend my original usage both in this thread and in the OP, which I'm not that attached to, I do think it would be typical to say that someone made a mistake if they were trying to help me get what I wanted, but failed to notice or communicate some crucial consideration that would totally change my views about what I wanted---the usual English usage of these terms involves at least mild idealization.

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-28T02:00:19.704Z · score: 2 (1 votes) · LW · GW

I agree that:

  • If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
  • A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).

I think that both

  • (a) Trying to have influence over aspects of value change that people don't much care about, and
  • (b) better understanding the important processes driving changes in values

are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it's worth being thoughtful about that.)

(I don't agree with the sign of the effect described in your comment, but don't think it's an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)


2018-11-27T21:50:10.517Z · score: 39 (9 votes)
Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-27T18:49:46.951Z · score: 2 (1 votes) · LW · GW

Could you give an example of a statement you think could be wrong on the realist perspective, for which there couldn't be a precisely analogous error on the non-realistic perspective?

There is some uninteresting semantic sense in which there are "more ways to be wrong" (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI.

I might be using the word "values" in a different way than. I think I can say something like "I'd like to deliberate in way X" and be wrong. I guess under non-realism I'm "incorrectly stating my preferences" and under realism I could be "correctly stating my preferences but be wrong," but I don't see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-27T18:43:24.281Z · score: 2 (1 votes) · LW · GW
I don't understand why you think 'MIRI folks won’t like the “beneficial AI” term because it is too broad a tent' given that someone from MIRI gave a very broad definition to "AI alignment". Do you perhaps think that Arbital article was written by a non-MIRI person?

I don't really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn't like a number of possible alternative terms to "alignment" because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of "alignment" refers to a much narrower class of problems than "beneficial AI" is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of "beneficial AI."

(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn't that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)

I'd consider it great if we standardized on "beneficial AI" to mean "AI that has good consequences" and "AI alignment" to refer to the narrower problem of aligning AI's motivation/preferences/goals.

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-27T18:34:57.209Z · score: 2 (1 votes) · LW · GW
“Do what H wants A to do” would be a moderate degree of alignment whereas "Successfully figuring out and satisfying H's true/normative values" would be a much higher degree of alignment (in that sense of alignment).

In what sense is that a more beneficial goal?

  • "Successfully do X" seems to be the same goal as X, isn't it?
  • "Figure out H's true/normative values" is manifestly a subgoal of "satisfy H's true/normative values." Why would we care about that except as a subgoal?
  • So is the difference entirely between "satisfy H's true/normative values" and "do what H wants"? Do you disagree with one of the previous two bullet points? Is the difference that you think "reliably pursues" implies something about "actually achieves"?

If the difference is mostly between "what H wants" and "what H truly/normatively values", then this is just a communication difficulty. For me adding "truly" or "normatively" to "values" is just emphasis and doesn't change the meaning.

I try to make it clear that I'm using "want" to refer to some hard-to-define idealization rather than some narrow concept, but I can see how "want" might not be a good term for this, I'd be fine using "values" or something along those lines if that would be clearer.

(This is why I wrote:

What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.


Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-27T01:11:05.352Z · score: 2 (1 votes) · LW · GW
But that definition seems quite different from your "A is trying to do what H wants it to do." For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be "aligned" but under MIRI's definition it wouldn't be (because it wouldn't be pursuing beneficial goals).

"Do what H wants me to do" seems to me to be an example of a beneficial goal, so I'd say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it's wrong about what H wants or has other mistaken empirical beliefs. I don't think anyone could be advocating the definition "pursues no harmful subgoals," since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?

I've been assuming that "reliably pursues beneficial goals" is weaker than the definition I proposed, but practically equivalent as a research goal.

I'm basically worried about this scenario: You or someone else writes something like "I'm cautiously optimistic about Paul's work." The reader recalls seeing you say that you work on "value alignment". They match that to what they've read from MIRI about how aligned AI "reliably pursues beneficial goals", and end up thinking that is easier than you'd intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is "de dicto value alignment" then that removes most of my worry.

I think it's reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying "AI alignment" regardless of how the term was defined, I normally clarify by saying something like "an AI which is at least trying to help us get what we want."

This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I'd be fine with it if everyone could coordinate to switch to these terms/definitions.

My guess is that MIRI folks won't like the "beneficial AI" term because it is too broad a tent. (Which is also my objection to the proposed definition of "AI alignment," as "overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.") My sense is that if that were their position, then you would also be unhappy with their proposed usage of "AI alignment," since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?

(They might also dislike "beneficial AI" because of random contingent facts about how it's been used in the past, and so might want a different term with the same meaning.)

My own feeling is that using "beneficial AI" to mean "AI that produces good outcomes in the world" is basically just using "beneficial" in accordance with its usual meaning, and this isn't a case where a special technical term is needed (and indeed it's weird to have a technical term whose definition is precisely captured by a single---different---word).

Comment by paulfchristiano on Humans Consulting HCH · 2018-11-27T00:00:04.767Z · score: 5 (2 votes) · LW · GW

Yes, if the queries aren't well-founded then HCH isn't uniquely defined even once you specify H, there is a class of solutions. If there is a bad solution, I think you need to do work to rule it out and wouldn't count on a method magically finding the answer.

Comment by paulfchristiano on Humans Consulting HCH · 2018-11-26T23:59:36.807Z · score: 7 (3 votes) · LW · GW

Depends on the human. I think 10 levels with branching factor 10 and 1 day per step is in the ballpark of "go from no calculus to general relativity," (at least if we strengthen the model by allowing pointers) but it's hard to know and most people aren't so optimistic.

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-26T22:24:24.428Z · score: 4 (2 votes) · LW · GW

Do you currently have any objections to using "AI alignment" as the broader term (in line with the MIRI/Arbital definition and examples) and "AI motivation" as the narrower term (as suggested by Rohin)?


  • The vast majority of existing usages of "alignment" should then be replaced by "motivation," which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that "A" should be the one that keeps the old word.
  • The word "alignment" was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it's a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said "We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”") Everywhere that anyone talks about alignment they use the analogy with "pointing," and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
  • In contrast, "alignment" doesn't really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term "beneficial AI," which really means exactly that. In explaining why MIRI doesn't like that term, Rob said

Some of the main things I want from a term are:

A. It clearly and consistently keeps the focus on system design and engineering, and whatever technical/conceptual groundwork is needed to succeed at such. I want to make it easy for people (if they want to) to just hash out those technical issues, without feeling any pressure to dive into debates about bad actors and inter-group dynamics, or garden-variety machine ethics and moral philosophy, which carry a lot of derail / suck-the-energy-out-of-the-room risk.

[…] ["AI safety" or "beneficial AI"] doesn't work so well for A -- it's commonly used to include things like misuse risk."

  • [continuing last point] The proposed usage of "alignment" doesn't meet this desiderata though, it has exactly the same problem as "beneficial AI," except that it's historically associated with this community. In particular it absolutely includes "garden-variety machine ethics and moral philosophy." Yes, there is all sorts of stuff that MIRI or I wouldn't care about that is relevant to "beneficial" AI, but under the proposed definition of alignment it's also relevant to "aligned" AI. (This statement by Rob also makes me think that you wouldn't in fact be happy with what he at least means by "alignment," since I take it you explicitly mean to include moral philosophy?)
  • People have introduced a lot of terms and change terms frequently. I've changed the language on my blog multiple times at other people's request. This isn't costless, it really does make things more and more confusing.
  • I think "AI motivation" is not a good term for this area of study: it (a) suggests it's about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if "alignment" is only slightly better), (c) is generally less optimized (related to the second point above, "alignment" is quite a good term for this area).
  • Probably "alignment" / "value alignment" would be a better split of terms than "alignment" vs. "motivation". "Value alignment" has traditionally been used with the de re reading, but I could clarify that I'm working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).

I guess I have an analogous question for you: do you currently have any objections to using "beneficial AI" as the broader term, and "AI alignment" as the narrower term?

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-26T22:00:41.526Z · score: 2 (1 votes) · LW · GW
I don't understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can't do anything about it, so 2% is how much you expect we can potentially "save" from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn't care about averting drift/corruption, then however their values drift that doesn't constitute any loss?

10x worse was originally my estimate for cost-effectiveness, not for total value at risk.

People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.

Humans Consulting HCH

2018-11-25T23:18:55.247Z · score: 19 (3 votes)

Approval-directed bootstrapping

2018-11-25T23:18:47.542Z · score: 19 (4 votes)
Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-25T01:01:28.272Z · score: 2 (1 votes) · LW · GW
It seems easier in that the AI / AI designer doesn't have to worry about the user being wrong about how they want their values to evolve.

That may be a connotation of the "preferences about how their values evolve," but doesn't seem like it follows from the anti-realist position.

I have preferences over what actions my robot takes. Yet if you asked me "what action do you want the robot to take?" I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don't know). My preferences over value evolution can be similar.

Indeed, if moral realists are right, "ultimately converge to the truth" is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing "help people's preferences evolve in the way they want them to evolve.") Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it's easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems).

I guess talking about "drift" has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.

I agree that drift is also problematic.

Comment by paulfchristiano on Approval-directed agents: overview · 2018-11-24T20:13:24.062Z · score: 2 (1 votes) · LW · GW

Ideally you want Hugh to be smarter than the process generating actions to take. (That's the idea of iterated amplification.)

Of course even if the generator is dumb and you search far enough you will still find actions on which Hugh performs poorly. The point of "internal" approval (which is not really relevant for prosaic AGI):

  • Allow Hugh to overseer dumber stuff, so that Hugh is more likely to be smarter than the process he is overseeing.
  • Allow Hugh to oversee "smaller" stuff, so that after many iterations Hugh's inputs can be restricted to a small enough space that we believe Hugh can behave reasonably for inputs in that space. (See security amplification.)
This of course will find an action sequence which will brainwash Hugh into approving.

(It will find a sequence only if you search over sequences and approve or disapprove of the whole thing, if you search over individual actions it will just try to find individual actions that lead Hugh to approve.)

Also note that the sequence of actions won't have a bad consequence when executed, just when described to Hugh.

Comment by paulfchristiano on Approval-directed agents: overview · 2018-11-24T20:07:01.497Z · score: 2 (1 votes) · LW · GW

Hugh is some human, Arthur is a cheap AI. For the obvious example today, compare:

  • Get mechanical turkers to label training images. Train an AI to predict the label they would assign. Use that AI to label images.
  • Use mechanical turkers to label images.

The second one is orders of magnitude more expensive and higher latency.

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-24T20:01:42.646Z · score: 2 (1 votes) · LW · GW
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I'm not sure how to argue this with you.

If you think this risk is very large, presumably there is some positive argument for why it's so large? That seems like the most natural way to run the argument. I agree it's not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.

In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: "(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don't have sufficiently good understanding." (Each of those claims obviously requires more argument, etc.)

One version of the case for worrying about value corruption would be:

  • It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
    • It may be that historical variation is itself problematic, and we care mostly about our particular values.
    • Or it may be that values are "hardened" against certain kinds of environment shift that occur in nature, and that they will go to some lower "default" level of robustness under new kinds of shifts.
    • Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
  • If so, the rate of change in subjective time could be reasonably high---perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn't endorsed for decision-theoretic reasons / hardened against).
  • It's plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
  • A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
  • If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
  • Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.

This kind of story is kind of conjunctive, so I'd expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).

My most basic concerns with this story are things like:

  • In "well-controlled" situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people's consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We've talked a little bit about this, and you've pointed out some reasons this kind of scheme isn't totally satisfactory even if it works as intended, but quantitatively the reasons you've pointed to don't seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
  • In most practical situations, it doesn't seem like "understanding of how to avert drift" is the key bottleneck to averting drift---it seems like the basic problem is that most people just don't care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That's still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.

In the end I'm doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that's how I get to the intuitive view that it's maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it's higher impact though.

As a rule of thumb, "if one x-risk seems X times bigger than another, it should have about X times as many people working on it" is intuitive appealingly to me, and suggests we should have at least 2 people working on "value corruption" even if you think that risk is 10x smaller, but I'm not sure if that makes sense to you.

I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk---maybe I'd expect us to have 2-10x more resources per unit risk devoted to more tractable risks.

In this case I'd be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.

I'm just hoping that you won't (intentionally or unintentionally) discourage people from working on "value corruption" so strongly that they don't even consider looking into that problem and forming their own conclusions based on their own intuitions/priors. [...] I don't want people to be excessively discouraged from working on the latter by statements like "motivation contains the most urgent part".

I agree I should be more careful about this.

I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I'm generally inclined to express my views), but could hedge more when making statements like this.

(I think saying "X is more urgent than Y" is basically compatible with the view "There should be 10 people working on X for each person working on Y," even if one also believes "but actually on the current margin investment in Y might be a better deal." Will edit the post to be a bit softer here though.

ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn't contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-24T19:25:05.760Z · score: 2 (1 votes) · LW · GW
Assuming you agree that we can't be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be.

I don't see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don't think it's a big difference.

Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone's true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.

It doesn't sound that way to me, but I'm happy to avoid framings that might give people the wrong idea.

I think I would prefer to frame the problem as "How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?"

My main complaint with this framing (and the reason that I don't use it) is that people respond badly to invoking the concept of "corruption" here---it's a fuzzy category that we don't understand, and people seem to interpret it as the speaker wanting values to remain static.

But in terms of the actual meanings rather than their impacts on people, I'd be about as happy with "avoiding corruption of values" as "having our values evolve in a positive way." I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.

Approval-directed agents: details

2018-11-23T23:26:08.681Z · score: 19 (4 votes)
Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-23T20:52:55.023Z · score: 4 (2 votes) · LW · GW
Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don't know how to do.)

I think MIRI's first use of this term was here where they said “We call a smarter-than-human system that reliably pursues beneficial goals `aligned with human interests' or simply `aligned.' ” which is basically the same as my definition. (Perhaps slightly weaker, since "do what the user wants you to do" is just one beneficial goal.) This talk never defines alignment, but the slide introducing the big picture says "Take-home message: We’re afraid it’s going to be technically difficult to point AIs in an intuitively intended direction" which also really suggests it's about trying to point your AI in the right direction.

The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction, though I suppose that may merely be an instance of suggestively naming the field "alignment" and then defining it to be "whatever is important" as a way of smuggling in the connotation that pointing your AI in the right direction is the important thing. All of the topics in the "AI alignment" domain (except for mindcrime, which is borderline) all fit under the narrower definition; the list of alignment researchers are all people working on the narrower problem.

So I think the way this term is used in practice basically matches this narrower definition.

As I mentioned, I was previously happily using the term "AI control." Rob Bensinger suggested that I stop using that term and instead use AI alignment, proposing a definition of alignment that seemed fine to me.

I don't think the very broad definition is what almost anyone has in mind when they talk about alignment. It doesn't seem to be matching up with reality in any particular way, except insofar as its capturing the problems that a certain group of people work on." I don't really see any argument in favor except the historical precedent, which I think is dubious in light of all of the conflicting definitions, the actual usage, and the explicit move to standardize on "alignment" where an alternative definition was proposed.

(In the discussion, the compromise definition suggested was "cope with the fact that the AI is not trying to do what we want it to do, either by aligning incentives or by mitigating the effects of misalignment.")

The "alignment problem for advanced agents" or "AI alignment" is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

Is this intended (/ do you understand this) to include things like "make your AI better at predicting the world," since we expect that agents who can make better predictions will achieve better outcomes?

If this isn't included, is that because "sufficiently advanced" includes making good predictions? Or because of the empirical view that ability to predict the world isn't an important input into producing good outcomes? Or something else?

If this definition doesn't distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.

If this excludes making better prediction because that's assumed by "sufficiently advanced agent," then I have all sorts of other questions (does "sufficiently advanced" include all particular empirical knowledge relevant to making the world better? does it include some arbitrary category not explicitly carved out in the definition?)

In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That's not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.

Independently of this issue, it seems like "the kinds of problems you are talking about in this thread" need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/skills/impact evaluations/outcomes/etc.).

Approval-directed agents: overview

2018-11-22T21:15:28.956Z · score: 22 (4 votes)
Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-22T20:26:58.946Z · score: 18 (3 votes) · LW · GW
How to prevent "aligned" AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic "power corrupts" problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

My position on this (that might be clear from previous discussions):

  • I agree this is a real problem.
  • From a technical perspective, I think this is even further from the alignment problem (than other AI safety problems), so I definitely think it should be studied separately and deserves a separate name.(Though the last bullet point in this comment implicitly gives an argument in the other direction.)
  • I'd normally frame this problem as "society's values will evolve over time, and we have preferences about how they evolve." New technology might change things in ways we don't endorse. Natural pressures like death may lead to changes we don't endorse (though that's a tricky values call). The constraint of remaining economically/militarily competitive could also force our values to evolve in a bad way (alignment is an instance of that problem, and eventually AI+alignment would address the other natural instance by decoupling human values from the competence needed to remain competitive). And of course there is a hard problem in that we don't know how to deliberate/reflect. The "figure out how to deliberate" problem seems like it is relatively easily postponed, since you don't have to solve it until you are doing deliberation, but the "help people avoid errors in deliberation" may be more urgent.
  • The reason I consider alignment more urgent is entirely quantitative and very empirically contingent, I don't think there is any simple argument against. I think there is a >1/3 chance that AI will be solidly superhuman within 20 subjective years, and that in those scenarios alignment destroys maybe 20% of the total value of the future, leading to 0.3%/year of losses from alignment, and right now it looks reasonably tractable. Influencing the trajectory of society's values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
  • I don't think I'm likely to work on this problem unless I either become much more pessimistic about working on alignment (e.g. because the problem is much harder or easier than I currently believe), I feel like I've already poked at it enough that VOI from more poking is lower than just charging ahead on alignment. But that is a stronger judgment than the last section, and I think is largely due to comparative advantage considerations, and I would certainly be supportive of work on this topic (e.g. would be happy to fund, would engage with it, etc.)
  • This is a leading contender for what I would do if alignment seemed unappealing, though I think that broader institutional improvement / capability enhancement / etc. seems more appealing. I'd definitely spend more time thinking about it.
  • I think that important versions of these problems really do exist with or without AI, although I agree that AI will accelerate the point at which they become critical while it's not obvious whether it will accelerate solutions. I don't think this is particularly important but does make me feel even more comfortable with the naming issue---this isn't really a problem about AI at all, it's just one of many issues that is modulated by AI.
  • I think the main way AI is relevant to the cost-effectiveness analysis of shaping-the-evolution-of-values is that it may decrease the amount of work that can be done on these problems between now and when they become serious (if AI is effectively accelerating the timeline for catastrophic value change without accelerating work on making values evolve in a way we'd endorse).
  • To the extent that the value of working on these problems is dominated by that scenario---"AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc."---then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn't a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
  • I would like to have clearer ways of talking and thinking about these problems, but (a) I think the next step is probably developing a better understanding (or, if someone has a much better understanding, then a development of a better shared understanding), (b) I really want a word other than "alignment," and probably multiple words. I guess the one that feels most urgently-unnamed right now is something like: understanding how values evolve and what features may introduce that evolution in a way we don't endorse, including both social dynamics, environmental factors, the need to remain competitive, and the dynamics of deliberation and argumentation.

Prosaic AI alignment

2018-11-20T13:56:39.773Z · score: 36 (9 votes)
Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-18T22:28:00.085Z · score: 6 (3 votes) · LW · GW
for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of "AI alignment" and into the "competence" part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.

I think it's bad to use a definitional move to try to implicitly prioritize or deprioritize research. I think I shouldn't have written: "I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment."

That said, I do think it's important that these seem like conceptually different problems and that different people can have different views about their relative importance---I really want to discuss them separately, try to solve them separately, compare their relative values (and separate that from attempts to work on either).

I don't think it's obvious that alignment is higher priority than these problems, or than other aspects of safety. I mostly think it's a useful category to be able to talk about separately. In general I think that it's good to be able to separate conceptually separate categories, and I care about that particularly much in this case because I care particularly much about this problem. But I also grant that the term has inertia behind it and so choosing its definition is a bit loaded and so someone could object on those grounds even if they bought that it was a useful separation.

(I think that "defending their values against manipulation from other AIs" wasn't include under any of the definitions of "alignment" proposed by Rob in our email discussion about possible definitions, so it doesn't seem totally correct to refer to this as "moving" those subproblems, so much as there already existing a mess of imprecise definitions some of which included and some of which excluded those subproblems.)

Comment by paulfchristiano on Clarifying "AI Alignment" · 2018-11-18T22:10:06.424Z · score: 14 (4 votes) · LW · GW
Aside from the practical issue of confusion between different usages of "AI alignment" (I think others like MIRI had been using "AI alignment" in a broader sense before Paul came up with his narrower definition)

I switched to this usage of AI alignment in 2017, after an email thread involving many MIRI people where Rob suggested using "AI alignment" to refer to what Bostrom calls the "second principal-agent problem" (he objected to my use of "control"). I think I misunderstood what Rob intended in that discussion, but my definition is meant to be in line with that---if the agent is trying to do what the principal wants, it seem like you've solved the principal-agent problem. I think the main way this definition is narrower than what was discussed in that email thread is by excluding things like boxing.

In practice, essentially all of MIRI's work seems to fit within this narrower definition, so I'm not too concerned at the moment with this practical issue (I don't know of any work MIRI feels strongly about that doesn't fit in this definition). We had a thread about this after it came up on LW in April, where we kind of decided to stick with something like "either make the AI trying to do the right thing, or somehow cope with the problems introduced by it trying to do the wrong thing" (so including things like boxing), but to mostly not worry too much since in practice basically the same problems are under both categories.

I should have updated this post before it got rerun as part of the sequence.

An unaligned benchmark

2018-11-17T15:51:03.448Z · score: 27 (6 votes)

Clarifying "AI Alignment"

2018-11-15T14:41:57.599Z · score: 53 (15 votes)

The Steering Problem

2018-11-13T17:14:56.557Z · score: 37 (9 votes)

Preface to the sequence on iterated amplification

2018-11-10T13:24:13.200Z · score: 39 (14 votes)

The easy goal inference problem is still hard

2018-11-03T14:41:55.464Z · score: 38 (9 votes)

Could we send a message to the distant future?

2018-06-09T04:27:00.544Z · score: 40 (14 votes)

When is unaligned AI morally valuable?

2018-05-25T01:57:55.579Z · score: 96 (28 votes)

Open question: are minimal circuits daemon-free?

2018-05-05T22:40:20.509Z · score: 109 (34 votes)

Weird question: could we see distant aliens?

2018-04-20T06:40:18.022Z · score: 85 (25 votes)

Implicit extortion

2018-04-13T16:33:21.503Z · score: 74 (22 votes)

Prize for probable problems

2018-03-08T16:58:11.536Z · score: 135 (37 votes)

Argument, intuition, and recursion

2018-03-05T01:37:36.120Z · score: 99 (29 votes)

Funding for AI alignment research

2018-03-03T21:52:50.715Z · score: 108 (29 votes)

The abruptness of nuclear weapons

2018-02-25T17:40:35.656Z · score: 95 (35 votes)

Arguments about fast takeoff

2018-02-25T04:53:36.083Z · score: 99 (31 votes)

Crowdsourcing moderation without sacrificing quality

2016-12-02T21:47:57.719Z · score: 15 (11 votes)

Optimizing the news feed

2016-12-01T23:23:55.403Z · score: 9 (10 votes)

Recent AI control posts

2016-11-29T18:53:57.656Z · score: 12 (13 votes)

If we can't lie to others, we will lie to ourselves

2016-11-26T22:29:54.990Z · score: 16 (17 votes)

Less costly signaling

2016-11-22T21:11:06.028Z · score: 14 (16 votes)

What is up with carbon dioxide and cognition? An offer

2016-04-23T17:47:43.494Z · score: 37 (30 votes)

My research priorities for AI control

2015-12-06T01:57:12.058Z · score: 17 (18 votes)

Experimental EA funding [crosspost]

2015-03-15T19:48:39.371Z · score: 12 (13 votes)

Recent AI safety work

2014-12-30T18:19:09.211Z · score: 20 (23 votes)

Approval-directed agents

2014-12-12T22:38:37.402Z · score: 9 (14 votes)

Changes to my workflow

2014-08-26T17:29:42.661Z · score: 28 (29 votes)

Seeking paid help for SPARC logistics

2013-07-02T16:38:45.766Z · score: 9 (10 votes)

Estimates vs. head-to-head comparisons

2013-05-04T06:45:37.520Z · score: 12 (19 votes)

Induction; or, the rules and etiquette of reference class tennis

2013-03-03T23:27:56.280Z · score: 6 (11 votes)

Risk-aversion and investment (for altruists)

2013-02-28T03:44:13.645Z · score: 7 (12 votes)

Why might the future be good?

2013-02-27T07:22:09.782Z · score: 6 (13 votes)

Link: blog on effective altruism

2013-02-08T06:18:39.018Z · score: 12 (15 votes)

My workflow

2012-12-09T21:16:42.564Z · score: 43 (43 votes)

Formalizing Value Extrapolation

2012-04-26T00:51:04.077Z · score: 14 (24 votes)

AIXI and Existential Despair

2011-12-08T20:03:34.203Z · score: 16 (23 votes)

The Absolute Self-Selection Assumption

2011-04-11T15:25:56.262Z · score: 28 (34 votes)

The Value of Theoretical Research

2011-02-25T18:06:26.100Z · score: 37 (36 votes)