# Clarifying "AI Alignment"

post by paulfchristiano · 2018-11-15T14:41:57.599Z · LW · GW · 82 comments

## Contents

  Analogy
Clarifications
Postscript on terminological history
None


When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.

This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.

In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.

## Analogy

Consider a human assistant who is trying their hardest to do what H wants.

I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem.

“Aligned” doesn’t mean “perfect:”

• They could misunderstand an instruction, or be wrong about what H wants at a particular moment in time.
• They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect.
• They may not know everything about H’s preferences, and so fail to recognize that a particular side effect is bad.
• They may build an unaligned AI (while attempting to build an aligned AI).

I use alignment as a statement about the motives of the assistant, not about their knowledge or ability. Improving their knowledge or ability will make them a better assistant — for example, an assistant who knows everything there is to know about H is less likely to be mistaken about what H wants — but it won’t make them more aligned.

(For very low capabilities it becomes hard to talk about alignment. For example, if the assistant can’t recognize or communicate with H, it may not be meaningful to ask whether they are aligned with H.)

## Clarifications

• The definition is intended de dicto rather than de re. An aligned A is trying to “do what H wants it to do.” Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges. I’d call this behavior aligned because A is trying to do what H wants, even though the thing it is trying to do (“buy apples”) turns out not to be what H wants: the de re interpretation is false but the de dicto interpretation is true.
• An aligned AI can make errors, including moral or psychological errors, and fixing those errors isn’t part of my definition of alignment except insofar as it’s part of getting the AI to “try to do what H wants” de dicto. This is a critical difference between my definition and some other common definitions. I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.
• An aligned AI would also be trying to do what H wants with respect to clarifying H’s preferences. For example, it should decide whether to ask if H prefers apples or oranges, based on its best guesses about how important the decision is to H, how confident it is in its current guess, how annoying it would be to ask, etc. Of course, it may also make a mistake at the meta level — for example, it may not understand when it is OK to interrupt H, and therefore avoid asking questions that it would have been better to ask.
• This definition of “alignment” is extremely imprecise. I expect it to correspond to some more precise concept that cleaves reality at the joints. But that might not become clear, one way or the other, until we’ve made significant progress.
• One reason the definition is imprecise is that it’s unclear how to apply the concepts of “intention,” “incentive,” or “motive” to an AI system. One naive approach would be to equate the incentives of an ML system with the objective it was optimized for, but this seems to be a mistake. For example, humans are optimized for reproductive fitness, but it is wrong to say that a human is incentivized to maximize reproductive fitness.
• “What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.

## Postscript on terminological history

I originally described this problem as part of “the AI control problem,” following Nick Bostrom’s usage in Superintelligence, and used “the alignment problem” to mean “understanding how to build AI systems that share human preferences/values” (which would include efforts to clarify human preferences/values).

I adopted the new terminology after some people expressed concern with “the control problem.” There is also a slight difference in meaning: the control problem is about coping with the possibility that an AI would have different preferences from its operator. Alignment is a particular approach to that problem, namely avoiding the preference divergence altogether (so excluding techniques like “put the AI in a really secure box so it can’t cause any trouble”). There currently seems to be a tentative consensus in favor of this approach to the control problem.

I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.

This post was originally published here on 7th April 2018.

The next post in this sequence will post on Saturday, and will be "An Unaligned Benchmark" by Paul Christiano.

Tomorrow's AI Alignment Sequences post will be the first in a short new sequence of technical exercises from Scott Garrabrant.

comment by rohinmshah · 2018-11-16T21:20:09.356Z · LW(p) · GW(p)

Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning [AF · GW] aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:

• It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
• Humans seem to solve the motivation subproblem, whereas humans don't seem to solve either the definition or the optimization subproblems. I can definitely imagine a human legitimately trying to help me, whereas I can't really imagine a human knowing how to derive optimal behavior for my goals, nor can I imagine a human that can actually perform the optimal behavior to achieve some arbitrary goal.
• It is easier to apply to systems without much capability, though as the post notes, it probably still does need to have some level of capability. While a digit recognition system is useful, it doesn't seem meaningful to talk about whether it is "trying" to help us.
• Relatedly, the safety guarantees seem to degrade more slowly and smoothly. With definition-optimization, if you get the definition even slightly wrong, Goodhart's Law suggests that you can get very bad outcomes. With motivation-competence, I've already argued that incompetence probably leads to small problems, not big ones, and slightly worse motivation might not make a huge difference because of something analogous to the basin of attraction around corrigibility. This depends a lot on what "slightly worse" means for motivation, but I'm optimistic.
• We've been working with the definition-optimization decomposition for quite some time now by modeling AI systems as expected utility maximizers, and we've found a lot of negative results and not very many positive ones.
• The motivation-competence decomposition accommodates interaction between the AI system and humans, which definition-optimization does not allow (or at least, it makes it awkward to include such interaction).

The cons are:

• It is imprecise and informal, whereas we can use the formalism of expected utility maximizers for the definition-optimization decomposition.
• There hasn't been much work done in this paradigm, so it is not obvious that there is progress to make.
• I suspect many researchers would argue that any sufficiently intelligent system will be well-modeled as an expected utility maximizer and will have goals and preferences it is optimizing for, and as a result we need to deal with the problems of expected utility maximizers anyway. Personally, I do not find this argument compelling, and hope to write about why in the near future. ETA: Written up in the chapter on Goals vs Utility Functions in the Value Learning sequence [? · GW], particularly in Coherence arguments do not imply goal-directed behavior [? · GW].
comment by habryka (habryka4) · 2018-11-16T22:40:12.853Z · LW(p) · GW(p)

This is a great comment, and maybe it should even be its own post. It clarified a bunch of things for me, and I think was the best concise argument for "we should try to build something that doesn't look like an expected utility maximizer" that I've read so far.

comment by rohinmshah · 2018-11-17T00:57:39.096Z · LW(p) · GW(p)

Thanks! The hope is to write something a bit more comprehensive that expands on many of these points, which would be its own post (or sequence).

comment by Wei_Dai · 2018-11-17T02:37:18.455Z · LW(p) · GW(p)

I agree with habryka that this is a really good explanation. I also agree with most of your pros and cons, but for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of "AI alignment" and into the "competence" part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.

In other words, the motivation-competence decomposition seems potentially very useful to me as a way to break down a larger problem into smaller parts so it can be solved more easily, but I don't agree that the urgent/not-urgent divide lines up neatly with the motivation/competence divide.

Aside from the practical issue of confusion between different usages of "AI alignment" (I think others like MIRI had been using "AI alignment" in a broader sense before Paul came up with his narrower definition), even using "AI alignment" in a context where it's clear that I'm using Paul's definition gives me the feeling that I'm implicitly agreeing to his understanding of how various subproblems should be prioritized.

comment by paulfchristiano · 2018-11-18T22:10:06.424Z · LW(p) · GW(p)
Aside from the practical issue of confusion between different usages of "AI alignment" (I think others like MIRI had been using "AI alignment" in a broader sense before Paul came up with his narrower definition)

I switched to this usage of AI alignment in 2017, after an email thread involving many MIRI people where Rob suggested using "AI alignment" to refer to what Bostrom calls the "second principal-agent problem" (he objected to my use of "control"). I think I misunderstood what Rob intended in that discussion, but my definition is meant to be in line with that---if the agent is trying to do what the principal wants, it seem like you've solved the principal-agent problem. I think the main way this definition is narrower than what was discussed in that email thread is by excluding things like boxing.

In practice, essentially all of MIRI's work seems to fit within this narrower definition, so I'm not too concerned at the moment with this practical issue (I don't know of any work MIRI feels strongly about that doesn't fit in this definition). We had a thread about this after it came up on LW in April, where we kind of decided to stick with something like "either make the AI trying to do the right thing, or somehow cope with the problems introduced by it trying to do the wrong thing" (so including things like boxing), but to mostly not worry too much since in practice basically the same problems are under both categories.

I should have updated this post before it got rerun as part of the sequence.

comment by Wei_Dai · 2018-11-23T04:16:55.629Z · LW(p) · GW(p)

Note that Arbital defines "AI alignment" as:

The "alignment problem for advanced agents" or "AI alignment" is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

and "total alignment" as:

An advanced agent can be said to be "totally aligned" when it can assess the exact value of well-described outcomes and hence the exact subjective value of actions, policies, and plans; where value has its overridden meaning of a metasyntactic variable standing in for "whatever we really do or really should value in the world or want from an Artificial Intelligence" (this is the same as "normative" if the speaker believes in normativity).

I think this clearly includes the kinds of problems I'm talking about in this thread. Do you agree? Also supporting my view is the history of "Friendliness" being a term that included the problem of better understanding the user's values (as in CEV) and then MIRI giving up that term in favor of "alignment" as an apparently exact synonym. See this MIRI post [EA · GW] which talks about "full alignment problem for fully autonomous AGI systems" and links to Arbital.

In practice, essentially all of MIRI’s work seems to fit within this narrower definition, so I’m not too concerned at the moment with this practical issue

I think you may have misunderstood what I meant by "practical issue". My point was that if you say something like "I think AI alignment is the most urgent problem to work on" the listener could easily misinterpret you as meaning "alignment" in the MIRI/Arbital sense. Or if I say "AI alignment is the most urgent problem to work on" in the MIRI/Arbital sense of alignment, the listener could easily misinterpret as meaning "alignment" your sense.

Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don't know how to do.)

comment by paulfchristiano · 2018-11-23T20:52:55.023Z · LW(p) · GW(p)
Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don't know how to do.)

I think MIRI's first use of this term was here where they said “We call a smarter-than-human system that reliably pursues beneficial goals aligned with human interests' or simply aligned.' ” which is basically the same as my definition. (Perhaps slightly weaker, since "do what the user wants you to do" is just one beneficial goal.) This talk never defines alignment, but the slide introducing the big picture says "Take-home message: We’re afraid it’s going to be technically difficult to point AIs in an intuitively intended direction" which also really suggests it's about trying to point your AI in the right direction.

The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction, though I suppose that may merely be an instance of suggestively naming the field "alignment" and then defining it to be "whatever is important" as a way of smuggling in the connotation that pointing your AI in the right direction is the important thing. All of the topics in the "AI alignment" domain (except for mindcrime, which is borderline) all fit under the narrower definition; the list of alignment researchers are all people working on the narrower problem.

So I think the way this term is used in practice basically matches this narrower definition.

As I mentioned, I was previously happily using the term "AI control." Rob Bensinger suggested that I stop using that term and instead use AI alignment, proposing a definition of alignment that seemed fine to me.

I don't think the very broad definition is what almost anyone has in mind when they talk about alignment. It doesn't seem to be matching up with reality in any particular way, except insofar as its capturing the problems that a certain group of people work on." I don't really see any argument in favor except the historical precedent, which I think is dubious in light of all of the conflicting definitions, the actual usage, and the explicit move to standardize on "alignment" where an alternative definition was proposed.

(In the discussion, the compromise definition suggested was "cope with the fact that the AI is not trying to do what we want it to do, either by aligning incentives or by mitigating the effects of misalignment.")

The "alignment problem for advanced agents" or "AI alignment" is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

Is this intended (/ do you understand this) to include things like "make your AI better at predicting the world," since we expect that agents who can make better predictions will achieve better outcomes?

If this isn't included, is that because "sufficiently advanced" includes making good predictions? Or because of the empirical view that ability to predict the world isn't an important input into producing good outcomes? Or something else?

If this definition doesn't distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.

If this excludes making better prediction because that's assumed by "sufficiently advanced agent," then I have all sorts of other questions (does "sufficiently advanced" include all particular empirical knowledge relevant to making the world better? does it include some arbitrary category not explicitly carved out in the definition?)

In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That's not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.

Independently of this issue, it seems like "the kinds of problems you are talking about in this thread" need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/skills/impact evaluations/outcomes/etc.).

comment by Wei_Dai · 2018-11-23T22:38:14.285Z · LW(p) · GW(p)

The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction

But the page includes:

"AI alignment theory" is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.

which seems to be outside of just "pointing an AI in a direction"

Is this intended (/​ do you understand this) to include things like “make your AI better at predicting the world,” since we expect that agents who can make better predictions will achieve better outcomes?

I think so, at least for certain kinds of predictions that seem especially important (i.e., may lead to x-risk if done badly), see this Arbital page which is under AI Alignment:

Vingean reflection is reasoning about cognitive systems, especially cognitive systems very similar to yourself (including your actual self), under the constraint that you can't predict the exact future outputs. We need to make predictions about the consequence of operating an agent in an environment via reasoning on some more abstract level, somehow.

If this definition doesn’t distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.

It seems to me that Rohin's proposal of distinguishing between "motivation" and "capabilities" is a good one, and then we can keep using "alignment" for the set of broader problems that are in line with the MIRI/Arbital definition and examples.

In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That’s not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.

It seems fine to me to include 1) problems that are greatly exacerbated by AI and 2) problems that aren't caused by AI but may be best solved/ameliorated by some element of AI design, since these are problems that AI researchers have a responsibility over and/or can potentially contribute to. If there's a problem that isn't exacerbated by AI and does not seem likely to have a solution within AI design then I'd not include that.

Independently of this issue, it seems like “the kinds of problems you are talking about in this thread” need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/​skills/​impact evaluations/​outcomes/​etc.).

Sure, agreed.

comment by paulfchristiano · 2018-11-18T22:28:00.085Z · LW(p) · GW(p)
for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of "AI alignment" and into the "competence" part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.

I think it's bad to use a definitional move to try to implicitly prioritize or deprioritize research. I think I shouldn't have written: "I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment."

That said, I do think it's important that these seem like conceptually different problems and that different people can have different views about their relative importance---I really want to discuss them separately, try to solve them separately, compare their relative values (and separate that from attempts to work on either).

I don't think it's obvious that alignment is higher priority than these problems, or than other aspects of safety. I mostly think it's a useful category to be able to talk about separately. In general I think that it's good to be able to separate conceptually separate categories, and I care about that particularly much in this case because I care particularly much about this problem. But I also grant that the term has inertia behind it and so choosing its definition is a bit loaded and so someone could object on those grounds even if they bought that it was a useful separation.

(I think that "defending their values against manipulation from other AIs" wasn't include under any of the definitions of "alignment" proposed by Rob in our email discussion about possible definitions, so it doesn't seem totally correct to refer to this as "moving" those subproblems, so much as there already existing a mess of imprecise definitions some of which included and some of which excluded those subproblems.)

comment by rohinmshah · 2018-11-17T18:37:22.586Z · LW(p) · GW(p)

Yeah, that seems right. I would probably defend the claim that motivation contains the most urgent part in the same way that Paul has done in the past -- it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values, and that it should not do irreversible high-impact actions until then. I'm less optimistic about defending values against manipulation, because you probably need to be very competent for that, and you can't take your time to become more competent, but that seems like a further-away problem to me and so less urgent.

(I don't think I have much to add over the discussions you and Paul have had in the past, but I'm happy to clarify my opinion if it seems useful to you -- perhaps my way of stating things will click where Paul's way didn't, idk. Or I might have different opinions and not realize it.)

I would support the idea of having this idea simply as a decomposition and not also pack in the implication that motivation/competence corresponds to urgent/not-urgent, though I suspect it is quite hard to do that now.

comment by Wei_Dai · 2018-11-18T12:20:46.592Z · LW(p) · GW(p)

I’m happy to clarify my opinion if it seems useful to you—perhaps my way of stating things will click where Paul’s way didn’t

I would highly welcome that. BTW if you see me argue with Paul in the future (or in the past) and I seem to be not getting something, please feel free to jump in and explain it a different way. I often find it easier to understand one of Paul's ideas from someone else's explanation.

it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values

Yes, that seems easy, but actually helping seems much harder.

and that it should not do irreversible high-impact actions until then

How do you determine what is "high-impact" before you have a utility function? Even "reversible" is relative to a utility function, right? It doesn't mean that you literally can reverse all the consequences of an action, but rather that you can reverse the impact of that action on your utility?

It seems to me that "avoid irreversible high-impact actions" would only work if one had a small amount of uncertainty over one's utility function, in which case you could just avoid actions that are considered "irreversible high-impact" by any the utility functions that you have significant probability mass on. But if you had a large amount of uncertainty, or just have very little idea what your utility function looks like, that doesn't work because almost any action could be "irreversible high-impact". For example if I were a negative utilitarian I perhaps ought to spend all my resources trying to stop technological progress leading to space colonization, so anything that I do besides that would be "irreversible high-impact" unless I could go back in time and change my resource allocation.

BTW, here is a section from a draft post that I'm working on. Do you think it would be easy to solve or avoid all of these problems? (This post isn't specifically addressing Paul's approach so some of them may be easy to solve under his approach but I don't think all of them are.)

How to prevent "aligned" AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic "power corrupts" problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

(Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to avoid or safely handle them.)

I’m less optimistic about defending values against manipulation, because you probably need to be very competent for that, and you can’t take your time to become more competent, but that seems like a further-away problem to me and so less urgent.

Why is that a further-away problem? Even if it is, we still need people to work on them now, if only to generate persuasive evidence in case they really are very hard problems so we can pursue some other strategy to avoid them like stopping or delaying the development of advanced AI as much as possible.

comment by paulfchristiano · 2018-11-22T20:26:58.946Z · LW(p) · GW(p)
How to prevent "aligned" AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic "power corrupts" problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

My position on this (that might be clear from previous discussions):

• I agree this is a real problem.
• From a technical perspective, I think this is even further from the alignment problem (than other AI safety problems), so I definitely think it should be studied separately and deserves a separate name.(Though the last bullet point in this comment implicitly gives an argument in the other direction.)
• I'd normally frame this problem as "society's values will evolve over time, and we have preferences about how they evolve." New technology might change things in ways we don't endorse. Natural pressures like death may lead to changes we don't endorse (though that's a tricky values call). The constraint of remaining economically/militarily competitive could also force our values to evolve in a bad way (alignment is an instance of that problem, and eventually AI+alignment would address the other natural instance by decoupling human values from the competence needed to remain competitive). And of course there is a hard problem in that we don't know how to deliberate/reflect. The "figure out how to deliberate" problem seems like it is relatively easily postponed, since you don't have to solve it until you are doing deliberation, but the "help people avoid errors in deliberation" may be more urgent.
• The reason I consider alignment more urgent is entirely quantitative and very empirically contingent, I don't think there is any simple argument against. I think there is a >1/3 chance that AI will be solidly superhuman within 20 subjective years, and that in those scenarios alignment destroys maybe 20% of the total value of the future, leading to 0.3%/year of losses from alignment, and right now it looks reasonably tractable. Influencing the trajectory of society's values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.
• I don't think I'm likely to work on this problem unless I either become much more pessimistic about working on alignment (e.g. because the problem is much harder or easier than I currently believe), I feel like I've already poked at it enough that VOI from more poking is lower than just charging ahead on alignment. But that is a stronger judgment than the last section, and I think is largely due to comparative advantage considerations, and I would certainly be supportive of work on this topic (e.g. would be happy to fund, would engage with it, etc.)
• This is a leading contender for what I would do if alignment seemed unappealing, though I think that broader institutional improvement / capability enhancement / etc. seems more appealing. I'd definitely spend more time thinking about it.
• I think that important versions of these problems really do exist with or without AI, although I agree that AI will accelerate the point at which they become critical while it's not obvious whether it will accelerate solutions. I don't think this is particularly important but does make me feel even more comfortable with the naming issue---this isn't really a problem about AI at all, it's just one of many issues that is modulated by AI.
• I think the main way AI is relevant to the cost-effectiveness analysis of shaping-the-evolution-of-values is that it may decrease the amount of work that can be done on these problems between now and when they become serious (if AI is effectively accelerating the timeline for catastrophic value change without accelerating work on making values evolve in a way we'd endorse).
• To the extent that the value of working on these problems is dominated by that scenario---"AI has a large comparative disadvantage at helping us solve philosophical problems / thinking about long-term trajectory / etc."---then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn't a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.
• I would like to have clearer ways of talking and thinking about these problems, but (a) I think the next step is probably developing a better understanding (or, if someone has a much better understanding, then a development of a better shared understanding), (b) I really want a word other than "alignment," and probably multiple words. I guess the one that feels most urgently-unnamed right now is something like: understanding how values evolve and what features may introduce that evolution in a way we don't endorse, including both social dynamics, environmental factors, the need to remain competitive, and the dynamics of deliberation and argumentation.
comment by Wei_Dai · 2018-11-24T07:36:16.444Z · LW(p) · GW(p)

I’d normally frame this problem as “society’s values will evolve over time, and we have preferences about how they evolve.”

This statement of the problem seems to assume a subjectivist or anti-realist view of metaethics (items 4 or 5 on this list [LW · GW]). Consider the analogous statement, "mathematicians' beliefs about mathematical statements will evolve over time, and they have preferences about how their beliefs evolve". I think a lot of mathematicians would object to that and instead say that they prefer to have true beliefs about mathematics, and their "preferences about how their beliefs evolve" are just their best guesses about how to arrive at true beliefs.

Assuming you agree that we can't be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be. It implies that instead of the AI (and indirectly the AI designer) potentially having (if a realist or relativist metaethical position is correct) an obligation/opportunity to help the user figure out what their true or normative values are, which may involve solving difficult metaethical and other philosophical questions, the AI can just follow the user's preferences about how their values evolve.

Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone's true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.

I think I would prefer to frame the problem as "How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?" and would consider this an instance of the more general problem "When considering AI safety, it's not safe to assume that the human user/operator/supervisor is a generally safe agent."

Influencing the trajectory of society’s values in other ways seems significantly worse than that to me (maybe 10x less cost-effective?). I think it would be useful to do some back-of-the-envelope calculations for the severity of value drift and the case for working on it.

To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I'm not sure how to argue this with you. Even if the total risk of "value corruption" is 10x smaller, it seems like the marginal impact of an additional researcher on "value corruption" could be higher given that there are now about 20(?) full time researchers working mostly on AI motivation but zero on this problem (as far as I know), and then we also have to consider the effect of a marginal researcher on the future growth of each field, and future effects on public opinion and policy makers. Unfortunately, I don't know how to calculate these things even in a back-of-the-envelope way. As a rule of thumb, "if one x-risk seems X times bigger than another, it should have about X times as many people working on it" is intuitive appealingly to me, and suggests we should have at least 2 people working on "value corruption" even if you think that risk is 10x smaller, but I'm not sure if that makes sense to you.

I don’t think I’m likely to work on this problem unless I either become much more pessimistic about working on alignment

I see no reason to convince you personally to work on "value corruption" since your intuition on the relative severity of the risks is so different from mine, and under either of our views we obviously still need people to work on motivation / alignment-in-your-sense. I'm just hoping that you won't (intentionally or unintentionally) discourage people from working on "value corruption" so strongly that they don't even consider looking into that problem and forming their own conclusions based on their own intuitions/priors.

To the extent that the value of working on these problems is dominated by that scenario—“AI has a large comparative disadvantage at helping us solve philosophical problems /​ thinking about long-term trajectory /​ etc.“—then I think that one of the most promising interventions on this problem is improving the relative capability of AI at problems of this form. My current view is that working on factored cognition (and similarly on debate) is a reasonable approach to that. This isn’t a super important consideration, but it overall makes me (a) a bit more excited about factored cognition (especially in worlds where the broader iterated amplification program breaks down), (b) a bit less concerned about figuring out whether relative capabilities is more or less important than alignment.

This seems totally reasonable to me, but 1) others may have other ideas about how to intervene on this problem, and 2) even within factored cognition or debate there are probably research directions that skew towards being more applicable to motivation and research directions that skew towards being more applicable to "value corruption" and I don't want people to be excessively discouraged from working on the latter by statements like "motivation contains the most urgent part".

comment by paulfchristiano · 2018-11-24T20:01:42.646Z · LW(p) · GW(p)
To me the x-risk of corrupting human values by well-motivated AI is comparable to the x-risk caused by badly-motivated AI (and both higher than 20% conditional on superhuman AI within 20 subjective years), but I'm not sure how to argue this with you.

If you think this risk is very large, presumably there is some positive argument for why it's so large? That seems like the most natural way to run the argument. I agree it's not clear what exactly the norms of argument here are, but the very basic one seems to be sharing the reason for great concern.

In the case of alignment there are a few lines of argument that we can flesh out pretty far. The basic structure is something like: "(a) if we built AI with our current understanding there is a good chance it would not be trying to do what we wanted or have enough overlap to give the future substantial value, (b) if we built sufficiently competent AI, the future would probably be shaped by its intentions, (c) we have a significant risk of not developing sufficiently better understanding prior to having the capability to build sufficiently competent AI, (d) we have a significant risk of building sufficiently competent AI even if we don't have sufficiently good understanding." (Each of those claims obviously requires more argument, etc.)

One version of the case for worrying about value corruption would be:

• It seems plausible that the values pursued by humans are very sensitive to changes in their environment.
• It may be that historical variation is itself problematic, and we care mostly about our particular values.
• Or it may be that values are "hardened" against certain kinds of environment shift that occur in nature, and that they will go to some lower "default" level of robustness under new kinds of shifts.
• Or it may be that normal variation is OK for decision-theoretic reasons (since we are the beneficiaries of past shifts) but new kinds of variation are not OK.
• If so, the rate of change in subjective time could be reasonably high---perhaps the change that occurs within one generation could shift value far enough to reduce value by 50% (if that change wasn't endorsed for decision-theoretic reasons / hardened against).
• It's plausible, perhaps 50%, that AI will accelerate kinds of change that lead to value drift radically more than it accelerates an understanding of how to prevent such drift.
• A good understanding of how to prevent value drift might be used / be a major driver of how well we prevent such drift. (Or maybe some other foreseeable institutional characteristics could have a big effect on how much drift occurs.)
• If so, then it matters a lot how well we understand how to prevent such drift at the time when we develop AI. Perhaps there will be several generations worth of subjective time / drift-driving change before we are able to do enough additional labor to obsolete our current understanding (since AI is accelerating change but not the relevant kind of labor).
• Our current understanding may not be good, and there may be a realistic prospect of having a much better understanding.

This kind of story is kind of conjunctive, so I'd expect to explore a few lines of argument like this, and then try to figure out what are the most important underlying uncertainties (e.g. steps that appear in most arguments of this form, or a more fundamental underlying cause for concern that generates many different arguments).

My most basic concerns with this story are things like:

• In "well-controlled" situations, with principals who care about this issue, it feels like we already have an OK understanding of how to avert drift (conditioned on solving alignment). It seems like the basic idea is to decouple evolving values from the events in the world that are actually driving competitiveness / interacting with the natural world / realizing people's consumption / etc., which is directly facilitated by alignment. The extreme form of this is having some human in a box somewhere (or maybe in cold storage) who will reflect and grow on their own schedule, and who will ultimately assume control of their resources once reaching maturity. We've talked a little bit about this, and you've pointed out some reasons this kind of scheme isn't totally satisfactory even if it works as intended, but quantitatively the reasons you've pointed to don't seem to be probable enough (per economic doubling, say) to make the cost-benefit analysis work out.
• In most practical situations, it doesn't seem like "understanding of how to avert drift" is the key bottleneck to averting drift---it seems like the basic problem is that most people just don't care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve. That's still something you can intervene on, but it feels like a huge morass where you are competing with many other forces.

In the end I'm doing a pretty rough calculation that depends on a whole bunch of stuff, but those feel like they are maybe the most likely differences in view / places where I have something to say. Overall I still think this problem is relatively important, but that's how I get to the intuitive view that it's maybe ~10x lower impact. I would grant the existence of (plenty of) people for whom it's higher impact though.

As a rule of thumb, "if one x-risk seems X times bigger than another, it should have about X times as many people working on it" is intuitive appealingly to me, and suggests we should have at least 2 people working on "value corruption" even if you think that risk is 10x smaller, but I'm not sure if that makes sense to you.

I think that seems roughly right, probably modulated by some O(1) factor factor reflecting tractability or other factors not captured in the total quantity of risk---maybe I'd expect us to have 2-10x more resources per unit risk devoted to more tractable risks.

In this case I'd be happy with the recommendation of ~10x more people working on motivation than on value drift, that feels like the right ballpark for basically the same reason that motivation feels ~10x more impactful.

I'm just hoping that you won't (intentionally or unintentionally) discourage people from working on "value corruption" so strongly that they don't even consider looking into that problem and forming their own conclusions based on their own intuitions/priors. [...] I don't want people to be excessively discouraged from working on the latter by statements like "motivation contains the most urgent part".

I do think that motivation contains the most urgent/important part and feel pretty comfortable expressing that view (for the same reasons I'm generally inclined to express my views), but could hedge more when making statements like this.

(I think saying "X is more urgent than Y" is basically compatible with the view "There should be 10 people working on X for each person working on Y," even if one also believes "but actually on the current margin investment in Y might be a better deal." Will edit the post to be a bit softer here though.

ETA: actually I think the language in the post basically reflects what I meant, the broader definition seems worse because it contains tons of stuff that is lower priority. The narrower definition doesn't contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff. But I will likely write a separate post or two at some point about value drift and other important problems other than motivation.)

comment by Wei_Dai · 2018-11-24T21:58:42.450Z · LW(p) · GW(p)

If you think this risk is very large, presumably there is some positive argument for why it’s so large?

Yeah, I didn't literally mean that I don't have any arguments, but rather that we've discussed it in the past and it seems like we didn't get close to resolving our disagreement. I tend to think that Aumann Agreement doesn't apply to humans, and it's fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don't think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there's too many people working on this problem, which might never actually happen).

it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve

I don't understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can't do anything about it, so 2% is how much you expect we can potentially "save" from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn't care about averting drift/corruption, then however their values drift that doesn't constitute any loss?

The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff.

I don't understand "better" in what sense. Whatever it is, why wouldn't it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we're not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using "AI alignment" as the broader term (in line with the MIRI/Arbital definition and examples) and "AI motivation" as the narrower term (as suggested by Rohin)?

comment by paulfchristiano · 2018-11-26T22:24:24.428Z · LW(p) · GW(p)

Do you currently have any objections to using "AI alignment" as the broader term (in line with the MIRI/Arbital definition and examples) and "AI motivation" as the narrower term (as suggested by Rohin)?

Yes:

• The vast majority of existing usages of "alignment" should then be replaced by "motivation," which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that "A" should be the one that keeps the old word.
• The word "alignment" was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it's a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said "We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”") Everywhere that anyone talks about alignment they use the analogy with "pointing," and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
• In contrast, "alignment" doesn't really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term "beneficial AI," which really means exactly that. In explaining why MIRI doesn't like that term, Rob said

Some of the main things I want from a term are:

A. It clearly and consistently keeps the focus on system design and engineering, and whatever technical/conceptual groundwork is needed to succeed at such. I want to make it easy for people (if they want to) to just hash out those technical issues, without feeling any pressure to dive into debates about bad actors and inter-group dynamics, or garden-variety machine ethics and moral philosophy, which carry a lot of derail / suck-the-energy-out-of-the-room risk.

[…] ["AI safety" or "beneficial AI"] doesn't work so well for A -- it's commonly used to include things like misuse risk."

• [continuing last point] The proposed usage of "alignment" doesn't meet this desiderata though, it has exactly the same problem as "beneficial AI," except that it's historically associated with this community. In particular it absolutely includes "garden-variety machine ethics and moral philosophy." Yes, there is all sorts of stuff that MIRI or I wouldn't care about that is relevant to "beneficial" AI, but under the proposed definition of alignment it's also relevant to "aligned" AI. (This statement by Rob also makes me think that you wouldn't in fact be happy with what he at least means by "alignment," since I take it you explicitly mean to include moral philosophy?)
• People have introduced a lot of terms and change terms frequently. I've changed the language on my blog multiple times at other people's request. This isn't costless, it really does make things more and more confusing.
• I think "AI motivation" is not a good term for this area of study: it (a) suggests it's about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if "alignment" is only slightly better), (c) is generally less optimized (related to the second point above, "alignment" is quite a good term for this area).
• Probably "alignment" / "value alignment" would be a better split of terms than "alignment" vs. "motivation". "Value alignment" has traditionally been used with the de re reading, but I could clarify that I'm working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).

I guess I have an analogous question for you: do you currently have any objections to using "beneficial AI" as the broader term, and "AI alignment" as the narrower term?

comment by Wei_Dai · 2018-11-27T00:21:04.351Z · LW(p) · GW(p)

This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”“)

But that definition seems quite different from your "A is trying to do what H wants it to do." For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be "aligned" but under MIRI's definition it wouldn't be (because it wouldn't be pursuing beneficial goals).

This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?

I think that's right. When I say MIRI/Arbital definition of "alignment" I'm referring to what's they've posted publicly, and I believe it does include moral philosophy. Rob's statement that you quoted seems to be a private one (I don't recall seeing it before and can't find it through Google search) but I can certainly see how it muddies the waters from your perspective.

Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed

This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I'm basically worried about this scenario: You or someone else writes something like "I'm cautiously optimistic about Paul's work." The reader recalls seeing you say that you work on "value alignment". They match that to what they've read from MIRI about how aligned AI "reliably pursues beneficial goals", and end up thinking that is easier than you'd intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is "de dicto value alignment" then that removes most of my worry.

I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?

This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I'd be fine with it if everyone could coordinate to switch to these terms/definitions.

comment by paulfchristiano · 2018-11-27T01:11:05.352Z · LW(p) · GW(p)
But that definition seems quite different from your "A is trying to do what H wants it to do." For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be "aligned" but under MIRI's definition it wouldn't be (because it wouldn't be pursuing beneficial goals).

"Do what H wants me to do" seems to me to be an example of a beneficial goal, so I'd say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it's wrong about what H wants or has other mistaken empirical beliefs. I don't think anyone could be advocating the definition "pursues no harmful subgoals," since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?

I've been assuming that "reliably pursues beneficial goals" is weaker than the definition I proposed, but practically equivalent as a research goal.

I'm basically worried about this scenario: You or someone else writes something like "I'm cautiously optimistic about Paul's work." The reader recalls seeing you say that you work on "value alignment". They match that to what they've read from MIRI about how aligned AI "reliably pursues beneficial goals", and end up thinking that is easier than you'd intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is "de dicto value alignment" then that removes most of my worry.

I think it's reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying "AI alignment" regardless of how the term was defined, I normally clarify by saying something like "an AI which is at least trying to help us get what we want."

This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I'd be fine with it if everyone could coordinate to switch to these terms/definitions.

My guess is that MIRI folks won't like the "beneficial AI" term because it is too broad a tent. (Which is also my objection to the proposed definition of "AI alignment," as "overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.") My sense is that if that were their position, then you would also be unhappy with their proposed usage of "AI alignment," since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?

(They might also dislike "beneficial AI" because of random contingent facts about how it's been used in the past, and so might want a different term with the same meaning.)

My own feeling is that using "beneficial AI" to mean "AI that produces good outcomes in the world" is basically just using "beneficial" in accordance with its usual meaning, and this isn't a case where a special technical term is needed (and indeed it's weird to have a technical term whose definition is precisely captured by a single---different---word).

comment by Wei_Dai · 2018-11-27T02:54:13.978Z · LW(p) · GW(p)

“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?

I guess both "reliable" and "beneficial" are matters of degree so "aligned" in the sense of "reliably pursues beneficial goals" is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas "Successfully figuring out and satisfying H's true/normative values" would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.

I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”

That seems good too.

My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.“) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?

This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world", which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I'm fine with calling this "beneficial AI" instead of "AI alignment" if everyone can coordinate on this (but I don't know how MIRI people feel about this). I don't understand why you think 'MIRI folks won’t like the “beneficial AI” term because it is too broad a tent' given that someone from MIRI gave a very broad definition to "AI alignment". Do you perhaps think that Arbital article was written by a non-MIRI person?

comment by paulfchristiano · 2018-11-27T18:34:57.209Z · LW(p) · GW(p)
“Do what H wants A to do” would be a moderate degree of alignment whereas "Successfully figuring out and satisfying H's true/normative values" would be a much higher degree of alignment (in that sense of alignment).

In what sense is that a more beneficial goal?

• "Successfully do X" seems to be the same goal as X, isn't it?
• "Figure out H's true/normative values" is manifestly a subgoal of "satisfy H's true/normative values." Why would we care about that except as a subgoal?
• So is the difference entirely between "satisfy H's true/normative values" and "do what H wants"? Do you disagree with one of the previous two bullet points? Is the difference that you think "reliably pursues" implies something about "actually achieves"?

If the difference is mostly between "what H wants" and "what H truly/normatively values", then this is just a communication difficulty. For me adding "truly" or "normatively" to "values" is just emphasis and doesn't change the meaning.

I try to make it clear that I'm using "want" to refer to some hard-to-define idealization rather than some narrow concept, but I can see how "want" might not be a good term for this, I'd be fine using "values" or something along those lines if that would be clearer.

(This is why I wrote:

What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.

)

comment by Wei_Dai · 2018-11-27T19:33:39.668Z · LW(p) · GW(p)

If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.

Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant "want" in an idealized sense but then forgot and didn't re-read the post to pick up that understanding again.)

ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by "values" you mean something like "current understanding of values" or "interim values" rather than "true/normative values" since it doesn't seem to make sense to want one's true/normative values to change over time.

I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.

I don't think "values" is good either. Both "want" and "values" are commonly used words that typically (in everyday usage) mean something like "someone's current understanding of what they want" or what I called "interim values". I don't see how you can expect people not to be frequently confused if you use either of them to mean "true/normative values". Like the situation with de re / de dicto alignment, I suggest it's not worth trying to economize on the adjectives here.

Another difference between your definition of alignment and "reliably pursues beneficial goals" is that the latter has "reliably" in it which suggests more of a de re reading. To use your example "Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges." I think most people would call an A that correctly understands H's preferences (and gets oranges) more reliably pursuing beneficial goals.

Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words "want", "values", or "alignment" from other people's.

comment by green_leaf · 2019-06-05T14:32:10.278Z · LW(p) · GW(p)
If the difference is mostly between "what H wants" and "what H truly/normatively values", then this is just a communication difficulty. For me adding "truly" or "normatively" to "values" is just emphasis and doesn't change the meaning.

So "wants" means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?

If that's right, how is this generalization defined? (E.g. The CEV was "what H wants in the limit of infinite intelligence, reasoning time and complete information".)

comment by paulfchristiano · 2018-11-27T18:43:24.281Z · LW(p) · GW(p)
I don't understand why you think 'MIRI folks won’t like the “beneficial AI” term because it is too broad a tent' given that someone from MIRI gave a very broad definition to "AI alignment". Do you perhaps think that Arbital article was written by a non-MIRI person?

I don't really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn't like a number of possible alternative terms to "alignment" because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of "alignment" refers to a much narrower class of problems than "beneficial AI" is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of "beneficial AI."

(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn't that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)

I'd consider it great if we standardized on "beneficial AI" to mean "AI that has good consequences" and "AI alignment" to refer to the narrower problem of aligning AI's motivation/preferences/goals.

comment by paulfchristiano · 2018-11-26T22:00:41.526Z · LW(p) · GW(p)
I don't understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can't do anything about it, so 2% is how much you expect we can potentially "save" from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn't care about averting drift/corruption, then however their values drift that doesn't constitute any loss?

10x worse was originally my estimate for cost-effectiveness, not for total value at risk.

People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.

comment by Wei_Dai · 2018-11-27T00:46:26.276Z · LW(p) · GW(p)

People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.

It's not obvious that applies here. If people don't care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people's values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.

As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.

comment by paulfchristiano · 2018-11-28T02:00:19.704Z · LW(p) · GW(p)

I agree that:

• If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
• A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).

I think that both

• (a) Trying to have influence over aspects of value change that people don't much care about, and
• (b) better understanding the important processes driving changes in values

are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it's worth being thoughtful about that.)

(I don't agree with the sign of the effect described in your comment, but don't think it's an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)

comment by Vladimir_Nesov · 2018-11-28T04:08:42.344Z · LW(p) · GW(p)

Trying to have influence over aspects of value change that people don't much care about ... [is] reasonable ... to do to make the future better

This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here "value" refers to true/normative/idealized preference). So it's hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn't experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what's relevant.

It's agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn't show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what's done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).

comment by paulfchristiano · 2018-11-24T19:25:05.760Z · LW(p) · GW(p)
Assuming you agree that we can't be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be.

I don't see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don't think it's a big difference.

Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone's true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.

It doesn't sound that way to me, but I'm happy to avoid framings that might give people the wrong idea.

I think I would prefer to frame the problem as "How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?"

My main complaint with this framing (and the reason that I don't use it) is that people respond badly to invoking the concept of "corruption" here---it's a fuzzy category that we don't understand, and people seem to interpret it as the speaker wanting values to remain static.

But in terms of the actual meanings rather than their impacts on people, I'd be about as happy with "avoiding corruption of values" as "having our values evolve in a positive way." I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.

comment by Wei_Dai · 2018-11-24T23:07:51.306Z · LW(p) · GW(p)

I don’t see why the anti-realist version is any easier

It seems easier in that the AI / AI designer doesn't have to worry about the user being wrong about how they want their values to evolve. But you're right that the realist version might be easier in other ways, so perhaps what I should say instead is that the problem definitely seems harder if we also include the subproblem of figuring out what the right metaethics is in the first place, and (by implicitly assuming a subset of all plausible metaethical positions) the statement of the problem that you proposed also does not convey a proper amount of uncertainty in its difficulty.

My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.

That's a good point that I hadn't thought of. (I guess talking about "drift" has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.) If you or anyone else have a suggestion about how to phrase the problem so as to both avoid this issue and address my concerns about not assuming a particular metaethical position, I'd highly welcome that.

comment by paulfchristiano · 2018-11-25T01:01:28.272Z · LW(p) · GW(p)
It seems easier in that the AI / AI designer doesn't have to worry about the user being wrong about how they want their values to evolve.

That may be a connotation of the "preferences about how their values evolve," but doesn't seem like it follows from the anti-realist position.

I have preferences over what actions my robot takes. Yet if you asked me "what action do you want the robot to take?" I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don't know). My preferences over value evolution can be similar.

Indeed, if moral realists are right, "ultimately converge to the truth" is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing "help people's preferences evolve in the way they want them to evolve.") Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it's easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems).

I guess talking about "drift" has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.

I agree that drift is also problematic.

comment by Wei_Dai · 2018-11-26T19:50:44.542Z · LW(p) · GW(p)

Would you agree with this way of stating it: There are more ways for someone to be wrong about their values under realism than under anti-realism. Under realism someone could be wrong even if they correctly state their preferences about how they want their values to evolve, because those preferences could themselves be wrong. So assuming an anti-realist position makes the problem sound easier because it implies there are fewer ways for the user to be wrong for the AI / AI designer to worry about.

comment by paulfchristiano · 2018-11-27T18:49:46.951Z · LW(p) · GW(p)

Could you give an example of a statement you think could be wrong on the realist perspective, for which there couldn't be a precisely analogous error on the non-realistic perspective?

There is some uninteresting semantic sense in which there are "more ways to be wrong" (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI.

I might be using the word "values" in a different way than. I think I can say something like "I'd like to deliberate in way X" and be wrong. I guess under non-realism I'm "incorrectly stating my preferences" and under realism I could be "correctly stating my preferences but be wrong," but I don't see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.

comment by Wei_Dai · 2018-11-28T12:05:14.422Z · LW(p) · GW(p)

Suppose the user says "I want to try to figure out my true/normative values by doing X. Please help me do that." If moral anti-realism is true, then the AI can only check if the user really wants to do X (e.g., by looking into the user's brain and checking if X is encoded as a preference somewhere). But if moral realism is true, the AI could also use its own understanding of metaethics and metaphilosophy to predict if doing X would reliably lead to the user's true/normative values, and warn the user or refuse to help or take some other action if the answer is no. Or if one can't be certain about metaethics yet, and it looks like X might prematurely lock the user into the wrong values, the AI could warn the user about that.

comment by paulfchristiano · 2018-11-28T19:55:06.660Z · LW(p) · GW(p)

I definitely don't mean such a narrow sense of "want my values to evolve." Seems worth using some language to clarify that.

In general the three options seem to be:

• You care about what is "good" in the realist sense.
• You care about what the user "actually wants" in some idealized sense.
• You care about what the user "currently wants" in some narrow sense.

It seems to me that the first two are pretty similar. (And if you are uncertain about whether realism is true, and you'd be in the first case if you accepted realism, it seems like you'd probably be in the second case if you rejected realism. Of course that would depend on the nature of your uncertainty about realism, your views could depend on an arbitrary way on whether realism is true or false depending on what versions of realism/non-realism are competing, but I'm assuming something like the most common realist and non-realist views around here.)

To defend my original usage both in this thread and in the OP, which I'm not that attached to, I do think it would be typical to say that someone made a mistake if they were trying to help me get what I wanted, but failed to notice or communicate some crucial consideration that would totally change my views about what I wanted---the usual English usage of these terms involves at least mild idealization.

comment by rohinmshah · 2018-11-18T19:53:42.271Z · LW(p) · GW(p)
Yes, that seems easy, but actually helping seems much harder.

Longer form of my opinion:

Metaphilosophy is hard, and we need to solve it eventually. This might happen by default, i.e. if we simply build a well-motivated AI without thinking about metaphilosophy and without running any social interventions designed to get the AI's operators to think about metaphilosophy, humanity might still realize that metaphilosophy needs to be solved, and then goes ahead and solves it. I'm quite unsure right now whether or not it will happen by default.

However, in the world where the AI's operators don't agree that we need to solve metaphilosophy, I am very pessimistic about the AI realizing that it should help us with metaphilosophy and doing so. The one way I could imagine it happening is by programming in the right utility function (not even learning it, since if you learn it then you probably learn that metaphilosophy doesn't need to be solved), which seems hopelessly doomed. It seems really hard to make an AI system where you can predict in advance that it will help us solve metaphilosophy regardless of the operator's wishes.

In the world where the AI's operators do agree that we need to solve metaphilosophy, I think we're in a much better position. A background assumption I have is that humans motivated to solve metaphilosophy will be able to do so given enough time -- I share Paul's intuition that humans who no longer have to worry about food, water, shelter, disease, etc. could deliberate for a long time and make progress. In that case, a well-motivated AI would be fine -- it would stay deferential, perhaps learn more things in order to be more competent, and does things we ask it to do, which might include helping us in our deliberation by bringing up arguments we hadn't considered yet. (And note a well-motivated AI should only bring up arguments it believes are true, or likely to be true.)

I've laid out two extreme ways the world could be, and of course there's a spectrum between them. But thinking about the extremes makes me think of this not as a part of AI alignment, but as a social coordination problem, that is, we need to have humanity (especially the AI's operators) agree that metaphilosophy is hard and needs to be solved. I'd support interventions that make this more likely, eg. more public writing that talks about what we do after AGI, or about the possibility of a Great Deliberation before using the cosmic endowment, etc. If we succeed at that and building a well-motivated AI system, I think that would be sufficient.

How do you determine what is "high-impact" before you have a utility function? Even "reversible" is relative to a utility function, right? It doesn't mean that you literally can reverse all the consequences of an action, but rather that you can reverse the impact of that action on your utility?

I mean something more like "don't do things that a human wouldn't do, that seem crazy from a human perspective". I'm not suggesting that the AI has a perfect understanding of what "irreversible" and "high-impact" mean. But it should be able to predict what things a human would find crazy for which it should probably get the human's approval before doing the thing. (As an analogy, most employees have a sense of what it is okay for them to take initiative on, vs. what they should get their manager's approval for.)

For example if I were a negative utilitarian I perhaps ought to spend all my resources trying to stop technological progress leading to space colonization, so anything that I do besides that would be "irreversible high-impact" unless I could go back in time and change my resource allocation.

Yeah, I more mean something like "continuation of the status quo" rather than "irreversible high-impact", as TurnTrout talks about below.

Do you think it would be easy to solve or avoid all of these problems?

I am not sure. I think it is relatively easy to look back at how we have responded to similar events in the past and notice that something is amiss -- for example, it seems relatively easy for an AGI to figure out that power corrupts and that humanity has not liked it when that happened, or that many humans don't like it when you take advantage of their motivational systems, and so to at least not be confident in the actions you mention. On the other hand, there may be similar types of events in the future that we can't back out by looking at the past. I don't know how to deal with these sorts of unknown unknowns.

I think sufficiently narrow AI systems have essentially no hope of solving or avoiding these problems in general, regardless of safety techniques we develop, and so in the short term to avoid these problems you want to intervene on the humans who are deploying AI systems.

Why is that a further-away problem? Even if it is, we still need people to work on them now, if only to generate persuasive evidence in case they really are very hard problems so we can pursue some other strategy to avoid them like stopping or delaying the development of advanced AI as much as possible.

Yeah, looking back I don't like that reason, I think I had an intuition that it wasn't an urgent problem and wanted to jot a quick sentence to that effect and the sentence came out wrong.

One reason it might not be urgent is because we need to aim for competitiveness anyway -- our AI systems need to be competitive so that economic incentives don't cause us to use unaligned variants.

We can also aim to have the world mostly run by aligned AI systems rather than unaligned ones, which would mean that there isn't much opportunity for us to be manipulated. You might have the intuition that even one unaligned AI could successfully manipulate everyone's values, and so we would still need the aligned AI systems to be able to defend against that. I'm not sure where I stand on that -- it seems possible to me that this is just very hard to do, especially when there are aligned superintelligent systems that would by default put a stop to it if they find out about it.

But really I'm just confused on this topic and would need to think more about it.

comment by Wei_Dai · 2018-11-19T08:09:48.544Z · LW(p) · GW(p)

we need to have humanity (especially the AI’s operators) agree that metaphilosophy is hard and needs to be solved

I'm not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?

But it should be able to predict what things a human would find crazy for which it should probably get the human’s approval before doing the thing

Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its "interim" values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only "safe" in the sense of having passed certain tests for a very narrow distribution of inputs.

Clearly it's not safe for a much more powerful outer AI to query the human about arbitrary actions that it's considering, right? Instead, if the human is to contribute anything at all to safety in this situation, the outer AI has to figure out how to generate a bunch of smaller queries that the human can safely handle, from which it would then infer what the human would say if it could safely consider the actual choice under consideration. If the AI is bad at this "competence" problem it could send unsafe queries to the human and corrupt the human, and/or infer the wrong thing about what the human would approve of.

Is it clearer now why this doesn't seem like an easy problem to me?

for example, it seems relatively easy for an AGI to figure out that power corrupts and that humanity has not liked it when that happened

I'm not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us? It seems far from easy to do this in a robust way. I mean this classifier would be facing lots of unpredictable distributional shifts... I guess you made a similar point when you said "On the other hand, there may be similar types of events in the future that we can’t back out by looking at the past."

ETA: Do you expect that different AIs would do different things in this regard depending on how cautious their operators are? Like some AIs would learn from their operators to be really cautious, and restrict technologies/choices that it isn't sure won't corrupt humans, but other operators and their AIs won't be so cautious so a bunch of humans will be corrupted as a result, but that's a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn't very high? (This is my current understanding of Paul's position, and I wonder if you have a different position or a different way of putting it that would convince me more.) What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers? What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?

I think sufficiently narrow AI systems have essentially no hope of solving or avoiding these problems in general, regardless of safety techniques we develop, and so in the short term to avoid these problems you want to intervene on the humans who are deploying AI systems.

This seems right.

We can also aim to have the world mostly run by aligned AI systems rather than unaligned ones, which would mean that there isn’t much opportunity for us to be manipulated.

Manipulation doesn't have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won't stop Alice from using it to manipulate Bob.

ETA: I forgot to mention that I don't understand this part, can you please explain more:

One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.

comment by rohinmshah · 2018-11-19T18:24:06.198Z · LW(p) · GW(p)
I'm not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?

I don't know, I want to outsource that decision to humans + AI at the time where it is relevant. Perhaps it involves stopping technological development. Perhaps it means continuing technological development, but not doing any space colonization. My point is simply that if humans agree that metaphilosophy needs to be solved, and the AI is trying to help humans, then metaphilosophy will probably be solved, even if I don't know how exactly it will happen.

Is it clearer now why this doesn't seem like an easy problem to me?

Yes. It seems to me like you're considering the case where a human has to be able to give the correct answer to any question of the form "is this action a good thing to do?" I'm claiming that we could instead grow the set of things the AI does gradually, to give time for humans to figure out what it is they want. So I was imagining that humans would answer the AI's questions in a frame where they have a lot of risk aversion, so anything that seemed particularly impactful would require a lot of deliberation before being approved.

I'm not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us?

I was thinking more of the case where a single human amassed a lot of power. Humans haven't seemed to solve the problem of predicting how new technologies/choices would change human values, so that seems like quite a hard problem to solve (but perhaps AI could do it). I meant more that conditional on the AI knowing how some new technology or choice would affect us, it seems not too hard to figure out whether we would view it as a good thing.

Do you expect that different AIs would do different things in this regard depending on how cautious their operators are?

Yes.

that's a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn't very high?

Kind of? I'd amend that slightly to say that to the extent that I think it is a problem (I'm not sure), I want to solve it in some way that is not technical research. (Possibilities: convince everyone to be cautious, obtain a decisive strategic advantage and enforce that everyone is cautious.)

What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers?

Same as above.

Manipulation doesn't have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won't stop Alice from using it to manipulate Bob.

Same as above. All of these problems that you're talking about would also apply to technology that could make a human smarter. It seems like it would be easiest to address on that level, rather than trying to build an AI system that can deal with these problems even though the operator would not want them to correct for the problem.

What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?

This seems like an empirical fact that makes the problems listed above harder to solve.

I forgot to mention that I don't understand this part, can you please explain more:
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.

So I broadly agree with Paul's reasons for aiming for competitiveness. Given competitiveness, you might hope that we would automatically get defense against value manipulation by other AIs, since our aligned AI will defend us from value manipulation by similarly-capable unaligned AIs (or aligned AIs that other people have). Of course, defense might be a lot harder than offense, and you probably do think that, in which case this doesn't really help us. (As I said, I haven't really thought about this before.)

Overall view: I don't think that the problems you've mentioned are obviously going to be solved as a part of AI alignment. I think that solving them will require mostly interventions on humans, not on the development of AI. I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result. If I were substantially more pessimistic, I would put more effort into strategy and governance issues. (Not sure I would change what I'm doing given my comparative advantage at technical research, but it would at least change what I advise other people do.)

Meta-view on our disagreement: I suspect that you have been talking about the problem of "making the future go well" while I've been talking about the problem of "getting AIs to do what we want" (which do seem like different problems to me). Most of the problems you've been talking about don't even make it into the bucket of "getting AIs to do what we want" the way I think about it, so some of the claims (like "the urgent part is in the motivation subproblem") are not meant to quantify over the problems you're identifying. I think we do disagree on how important the problems you identify are, but not as much as you would think, since I'm quite uncertain about this area of problem-space.

comment by Wei_Dai · 2018-11-19T22:12:27.292Z · LW(p) · GW(p)

I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result.

Why isn't that also an argument against the urgency of solving AI motivation? I.e., we don't need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?

It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.

Aside from that, I think it's also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That's why I was asking you for details of what you think the social solutions would look like.

so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying

I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.

Ok, I appreciate that.

comment by rohinmshah · 2018-11-20T19:31:09.621Z · LW(p) · GW(p)
Why isn't that also an argument against the urgency of solving AI motivation? I.e., we don't need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?

Two reasons come to mind:

• Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
• Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you're going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we're going to have lives of leisure due to automation, or to delay full automation if we're soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.

I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.

It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.

I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator's wishes-upon-reflection. (If your technical solution is in line with the operator's wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator's wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that's going to do something they don't want?).

You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can't achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.

Aside from that, I think it's also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That's why I was asking you for details of what you think the social solutions would look like.

Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven't tried hard enough to say that it's too difficult to do.

I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.

I'll keep that in mind. When I wrote the original comment, I wasn't even thinking about problems like the ones you mention, because I categorize them as "strategy" by default, and I was trying to talk about the technical problem.

comment by Wei_Dai · 2018-11-21T10:14:41.255Z · LW(p) · GW(p)

Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.

Do you think that at the time when AI development wasn't an already-running process, and AI was still a new thing that the public could be expected to be risk-averse about (when would you say that was?), the argument "working on alignment isn't urgent because humans can probably coordinate to stop AI development" would have been a good one?

Relatedly, it is a lot easier to make norms/​laws/​regulations now that bind our future selves.

Same question here. Back when "don't develop AI" was still a binding on our future selves, should we have expected that we will coordinate to stop AI development, and it's just bad luck that we haven't succeeded in doing that?

Looking at the things governments and corporations say, it seems like they would be likely to do things like this.

Can you be more specific? What global agreement do you think would be reached, that is both realistic and would solve the kinds of problems that I'm worried about (e.g., unintentional corruption of humans by "aligned" AIs who give humans too much power or options that they can't handle, and deliberate manipulation of humans by unaligned AIs or AIs aligned to other users)?

I think it would help me if you suggested some ways that technical solutions could help with these problems.

For example, create an AI that can help the user with philosophical questions at least as much as technical questions. (This could be done for example by figuring out how to better use Iterated Amplification to answer philosophical questions, or how to do imitation learning of human philosophers, or how to apply inverse reinforcement learning to philosophical reasoning.) Then the user could ask questions like "Am I likely to be corrupted by access to this technology? What can I do to prevent that while still taking advantage of it?" Or "Is this just an extremely persuasive attempt at manipulation or an actually good moral argument?"

As another example, solve metaethics and build that into the AI so that the AI can figure out or learn the actual terminal values of the user, which would make it easier to protect the user from manipulation and self-corruption. And even if the human user is corrupted, the AI still has the correct utility function, and when it has made enough technological progress it can uncorrupt the human.

I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers.

Can you point me to any relevant results that have been written down, or explain what you learned from those conversations?

On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.

To address this and the question (from the parallel thread) of whether you should personally work on this, I think we need people to either solve the technical problems or at least to collectively try hard enough to convincingly say that it's too difficult to do. (Otherwise who is going to convince policymakers to adopt the very costly social solutions? Who is going to convince people to start/join a social movement to influence policymakers to consider those costly social solutions? The fact that those things tend to take a lot of time seems like sufficient reason for urgency on the technical side, even if you expect the social solutions to be feasible.) Who are these people going to be, especially the first ones to join the field and help grow it? Probably existing AI alignment researchers, right? (I can probably make stronger arguments in this direction but I don't want to be too "pushy" so I'll stop here.)

comment by Wei_Dai · 2018-11-20T15:41:35.824Z · LW(p) · GW(p)

I forgot to followup on this important part of our discussion:

All of these problems that you're talking about would also apply to technology that could make a human smarter.

It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I'm talking about (which are largely caused by technological progress outpacing philosophical/moral progress). I could make some arguments about this, but I'm curious if this doesn't seem obvious to you.

Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it's a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this (especially with regard to the particular problems that I'm pointing out).

comment by rohinmshah · 2018-11-20T19:43:51.613Z · LW(p) · GW(p)
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I'm talking about (which are largely caused by technological progress outpacing philosophical/moral progress).

Yes, I agree with this. The reason I mentioned that was to make the point that the problems are a function of progress in general and aren't specific to AI -- they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.

Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it's a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this.

This seems true. Just to make sure I'm not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?

comment by Wei_Dai · 2018-11-21T10:15:02.923Z · LW(p) · GW(p)

The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.

This doesn't make much sense to me. Why is this any kind of reason to expect that solutions are likely to come from outside of AI? Can you give me an analogy where this kind of reasoning more obviously makes sense?

Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?

Right, this argument wasn't targeted to you, but I think there are other reasons for you to personally prioritize this. See my comment in the parallel thread.

comment by TurnTrout · 2018-11-18T17:23:32.293Z · LW(p) · GW(p)

It seems to me that "avoid irreversible high-impact actions" would only work if one had a small amount of uncertainty over one's utility function, in which case you could just avoid actions that are considered "irreversible high-impact" by any the utility functions that you have significant probability mass on. But if you had a large amount of uncertainty, or just have very little idea what your utility function looks like, that doesn't work because almost any action could be "irreversible high-impact".

From the AUP perspective [AF · GW], this only seems true in a way analogous to the statement that "any hypothesis can have arbitrarily long description length". It’s possible to make practically no assumptions about what the true utility function is and still recover a sensible notion of "low impact". That is, penalizing shifts in attainable utility for even random or simple functions still yields the desired behavior; I have experimental results to this effect which aren’t yet published. This suggests that the notion of impact captured by AUP isn’t dependent on realizability of the true utility, and hence the broader thing Rohin is pointing at should be doable.

While it’s true that some complex value loss is likely to occur when not considering an appropriate distribution over extremely complicated utility functions, it seems by-and-large negligible. This is because such loss occurs either as a continuation of the status quo or as a consequence of something objectively mild, which seems to correlate strongly with reasonably human-values mild.

comment by Wei_Dai · 2019-08-22T16:50:00.450Z · LW(p) · GW(p)

Another con of the motivation-competence decomposition: unlike definition-optimization, it doesn't actually seem to be a clean decomposition of the larger task, such that we can solve each subtask independently and then combine the solutions.

For example one way we could solve the motivation problem is by building a perfect human imitation (of someone who really wants to help H do what H wants), but then we seem to be stuck on the "competence" front, and there's no clear way to plug this solution of "motivation" into a better generic solution to "competence" to get a more competent intent-aligned agent. Instead it seems like we have to solve the competence problem that is particular to the specific solution to motivation, or solve motivation and competence together as one large problem.

In contrast, the problem of specifying an aligned utility function and the problem of building a safe EU maximizers seem to be naturally independent problems, such that once we have a specification of an aligned utility function (or a method of specifying aligned utility functions), we can just plug that into more and more powerful and robust EU maximizers.

Furthermore I think this lack of clean decomposition shows up at the conceptual level too, not just the pragmatic level. For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn't very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both? It seems arguable or hard to say. In contrast, in a system that is built using the definition-optimization decomposition, it seems like it would be easy to trace any safety failures to either the "definition" solution or the "optimization" solution.

comment by rohinmshah · 2019-08-22T18:24:32.827Z · LW(p) · GW(p)

I overall agree that this is a con. Certainly there are AI systems that are weak enough that you can't talk coherently about their "motivation". Probably all deep-learning-based systems fall into this category.

I also agree that (at least for now, and probably in the future as well) you can't formally specify the "type signature" of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem.

My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I'm aware that you disagree with this).

I don't agree that it's not clean at the conceptual level. It's perhaps less clean than the definition-optimization decomposition, but not much less.

For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn't very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both?

This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don't want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.

comment by Wei_Dai · 2019-08-22T18:51:14.986Z · LW(p) · GW(p)

This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.

It also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants.

Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don't count, so we're excluding problems like inner alignment from motivation / intent alignment?

I'm unsure what your answer would be, and what Paul's answer would be, and whether they would be the same, which at least suggests that the concepts haven't been cleanly decomposed yet.

ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won't cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is "no".

comment by rohinmshah · 2019-08-22T21:33:55.207Z · LW(p) · GW(p)

Oh, I see, you're talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn't apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I'd say it fails motivation, mostly because the system doesn't really have a single "motivation").

It doesn't seem like the definition-optimization decomposition helps either? I don't know whether I'd call that a failure of definition or optimization.

Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won't cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)?

I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn't apply this framework to the system (and I also wouldn't apply definition-optimization to the system).

comment by Wei_Dai · 2019-08-22T23:54:13.770Z · LW(p) · GW(p)

That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it

This was an unexpected answer. Isn't HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn't what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn't apply to IDA either. I'm pretty sure Paul would give a different answer, if we ask him about "intent alignment".

It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.

It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.

comment by paulfchristiano · 2019-08-23T02:28:45.605Z · LW(p) · GW(p)

Yes, I'd say that to the extent that "trying to do X" is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.

Even a very theoretically simple system like AIXI doesn't seem to be "trying" to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to "know" that its actions won't lead to reward.

You could say that AIXI is "optimizing" the right thing and just messing up when it suffers inner alignment failures, but I'm not convinced that this division is actually doing much useful work. I think it's meaningful to say "defining what we want is useful," but beyond that it doesn't seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.

(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I've mostly stopped working along these lines because it no longer seems directly useful.)

It seems more obvious that multiagent systems just falls outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.

I agree.

Of course, it also seems quite likely that AIs of the kind that will probably be built ("by default") also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.

comment by Wei_Dai · 2019-08-23T18:50:32.926Z · LW(p) · GW(p)

Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.

So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn't want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we've had a similar discussion before and either it didn't get resolved or I didn't understand your position. I didn't see a direct attempt to answer this in the comment I'm replying to, and it's fine if you don't want to go down this road again but I want to convey my continued confusion.)

You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.

I don't understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin's. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn't realize.)

Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.

This makes sense.

comment by paulfchristiano · 2019-08-23T19:25:57.611Z · LW(p) · GW(p)
Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn't want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else?

The oracle is not aligned when asked questions that cause it to do malign optimization.

The human+oracle system is not aligned in situations where the human would pose such questions.

For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.

For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don't even resemble optimizers.

The definition in this post is a bit sloppy here, but I'm usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in "what H wants it to do" (such that there can be several different preferences that are "what H wants") we could say something like:

A is aligned with H if everything it is trying to do is "what H wants."

That's not great either though (and I think the original post is more at an appropriate level of attempted-precision).

comment by Wei_Dai · 2019-08-23T22:47:55.205Z · LW(p) · GW(p)

(In the following I will also use "aligned" to mean "intent aligned".)

The human+oracle system is not aligned in situations where the human would pose such questions.

Ok, sounds like "intent aligned at some points in time and not at others" was the closest guess. To confirm, would you endorse "the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question"?

Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the "intent alignment problem" has been solved for an AI, or when would you call an AI (such as IDA) itself "intent aligned"? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use "intent alignment" you always have some specific situation or set of situations in mind?

comment by rohinmshah · 2019-08-23T21:48:18.301Z · LW(p) · GW(p)

Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:

Isn't HCH also such a multiagent system?

Yes, I shouldn't have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is "trying to do", i.e. I wouldn't say it has a single "motivation". This allows you to say "the system is not intent-aligned", even though you can't say "the system is trying to do X".

Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn't make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:

For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.

Also, I want to note strong agreement with this:

Of course, it also seems quite likely that AIs of the kind that will probably be built ("by default") also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
comment by Wei_Dai · 2019-08-23T22:48:12.644Z · LW(p) · GW(p)

Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent.

HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) "What is a good approximation of the user's utility function?" followed by "What action would maximize EU according to this utility function?"

ETA: If this isn't clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-01-18T18:46:37.210Z · LW(p) · GW(p)

In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.

Rohin Shah's comment [AF(p) · GW(p)] on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definition-optimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivation-competence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.

In contrast, I will argue that the "motivation-competence" decomposition is not as useful as Paul and Rohin believe, and the "definition-optimization" decomposition is more useful.

The thesis behind the "motivation-competence" decomposition implicitly assumes a linear, one-dimensional scale of competence. Agents with good motivations and subhuman competence might make silly mistakes but are not catastrophically dangerous (since they are subhuman). Agents with good motivations and superhuman competence will only do mistakes that are "forgivable" in the sense that, our own mistakes would be as bad or worse. Ergo (the thesis concludes), good motivations are sufficient to solve AI safety.

However, in reality competence is multi-dimensional. AI systems can have subhuman skills in some domains and superhuman skills in other domains, as AI history showed time and time again. This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user. Moreover, there might be limits to the agent's knowledge about certain questions (such as, the user's preferences) that are inherent in the agent's epistemology (more on this below). Given such limits, the agent's competence becomes systematically lopsided. Furthermore, the elimination of such limits is as a large part of the "definition" part in the "definition-optimization" framing that the thesis rejects.

As a consequence of the multi-dimensional natural of competence, the difference between "well intentioned mistake" and "malicious sabotage" is much less clear than naively assumed, and I'm not convinced there is a natural way to remove the ambiguity. For example, consider a superhuman AI Alpha subject to an acausal attack. In this scenario, some agent Beta in the "multiverse" (= prior) convinces Alpha that Alpha exists in a simulation controlled by Beta. The simulation is set up to look like the real Earth for a while, making it a plausible hypothesis. Then, a "treacherous turn" moment arrives in which the simulation diverges from Earth, in a way calculated to make Alpha take irreversible actions that are beneficial for Beta and disastrous for the user.

In the above scenario, is Alpha "motivation-aligned"? We could argue it is not, because it is running the malicious agent Beta. But we could also argue it is motivtion-aligned, it just makes the innocent mistake of falling for Beta's trick. Perhaps it is possible to clarify the concept of "motivation" such that in this case, Alpha's motivations are considered bad. But, such a concept would depend in complicated ways on the agent's internals. I think that this is a difficult and unnatural approach, compared to "definition-optimization" where the focus is not on the internals but on what the agent actually does (more on this later).

The possibility of acausal attacks is a symptom of the fact that, environments with irreversible transitions are usually not learnable (this is the problem of traps in reinforcement learning, that I discussed for example here [AF · GW] and here [AF(p) · GW(p)]), i.e. it is impossible to guarantee convergence to optimal expected utility without further assumptions. When we add preference learning to the mix, the problem gets worse because now even if there are no irreversible transitions, it is not clear the agent will converge to optimal utility. Indeed, depending on the value learning protocol, there might be uncertainties about the user's preferences that the agent can never resolve (this is an example of what I meant by "inherent limits" before). For example, this happens in CIRL (even if the user is perfectly rational, this happens because the user and the AI have different action sets).

These difficulties with the "motivation-competence" framing are much more natural to handle in the "definition-optimization" framing. Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK). Specifically, the mathematical criteria of alignment I proposed are the "dynamic subjective regret bound" [AF(p) · GW(p)] and the "dangerousness bound" [AF(p) · GW(p)]. The former is a criterion which simultaneous guarantees motivation-alignment and competence (as evidence that this criterion can be satisfied, I have the Dialogic Reinforcement Learning [AF(p) · GW(p)] proposal). The latter is a criterion that doesn't guarantee competence in general, but guarantees specifically avoiding catastrophic mistakes. This makes it closer to motivation-alignment compated to subjective regret, but different in important ways: it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.

In summary, I am skeptical that "motivation" and "competence" can be cleanly separately in a way that is useful for AI safety, whereas "definition" and "optimization" can be so separated: for example the dynamic subjective regret bound is a "definition" whereas dialogic RL and putative more concrete implementations thereof are "optimizations". My specific proposals might have fatal flaws that weren't discovered yet, but I believe that the general principle of "definition-optimization" is sound, while "motivation-competence" is not.

comment by rohinmshah · 2020-01-19T00:29:04.522Z · LW(p) · GW(p)
This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.

Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.

In the above scenario, is Alpha "motivation-aligned"

If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.

But, such a concept would depend in complicated ways on the agent's internals.

That, or it could depend on the agent's counterfactual behavior in other situations. I agree it can't be just the action chosen in the particular state.

Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).

I guess you wouldn't count universality. Overall I agree. I'm relatively pessimistic about mathematical formalization. (Probably not worth debating this point; feels like people have talked about it at length in Realism about rationality [LW(p) · GW(p)] without making much progress.)

it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.

I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-01-19T14:18:51.710Z · LW(p) · GW(p)

This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.

Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.

The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intent-alignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means).

If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Now, I do believe that if you set up the prior correctly then it won't happen, thanks to a mechanism like: Alpha knows that in case of dangerous uncertainty it is safe to fall back on some "neutral" course of action plus query the user (in specific, safe, ways). But this exactly shows that intent-alignment is not enough and you need further assumptions.

Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).

I guess you wouldn't count universality. Overall I agree.

Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I'm missing something.

I'm relatively pessimistic about mathematical formalization.

I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.

I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.

No, I make no such assumption. A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user's subjective perspective. It is neither needed nor possible to prove that the AI can never enter a trap. For example, the AI is immune to acausal attacks to the extent that the user beliefs that the AI is not inside Beta's simulation. On the other hand, if the user beliefs that the simulation hypothesis needs to be taken into account, then the scenario amounts to legitimate acausal bargaining (which has its own complications to do with decision/game theory, but that's mostly a separate concern).

comment by rohinmshah · 2020-01-19T21:53:40.846Z · LW(p) · GW(p)
A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user's subjective perspective.

Sorry, that's right. Fwiw, I do think subjective regret bounds are significantly better than the thing I meant by definition-optimization.

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Why doesn't this also apply to subjective regret bounds?

My guess at your answer is that Alpha wouldn't take the irreversible action as long as the user believes that Alpha is not in Beta-simulation-world. I would amend that to say that Alpha has to know that [the user doesn't believe that Alpha is in Beta-simulation-world]. But if Alpha knows that, then surely Alpha can predict that the user would not confirm the irreversible action?

It seems like for subjective regret bounds, avoiding this scenario depends on your prior already "knowing" that the user thinks that Alpha is not in Beta-simulation-world (perhaps by excluding Beta-simulations). If that's true, you could do the same thing with intent alignment / corrigibility.

Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I'm missing something.

It isn't equivalent to intent alignment; but it is meant to be used as part of an argument for safety, though I guess it could be used in definition-optimization too, so never mind.

I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.

That is hard to say. I would want to have the reaction "oh, if I built that system, I expect it to be safe and competitive". Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.

I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models.

Tbc, I also don't think intent alignment will lead to a mathematical formalization I'm happy with -- it "passes the buck" to the problem of defining what "trying" is, or what "corrigibility" is.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-01-21T19:11:49.103Z · LW(p) · GW(p)

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Why doesn't this also apply to subjective regret bounds?

In order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the short-term: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this [AF · GW] for one toy model of how "safe in the short-term" can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.

Now, you can say "with the right prior intent-alignment also works". To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work. Indeed, we can imagine that the ontology on which the prior is defined includes a "true reward" symbol s.t., by definition, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intent-aligned. If it's doing something bad from the user's perspective, then it is just an "innocent" mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all.

Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.

This is related to what I call the distinction between "weak" and "strong feasibility". Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis.

It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don't see any reason of principle that significantly limits the strong feasibility results we can expect. Indeed, we already have some advances in providing a theoretical basis for deep learning.

However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability. Instead, I prefer studying safety on the weak feasibility level until we understood everything important on this level, and only then trying to extend it to strong feasibility. This creates somewhat of a conundrum where apparently the one thing that can convince you (and other people?) is the thing I don't think should be done soon.

I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models.

Can you explain what you mean here? I agree that just saying "subjective regret bound" is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user. Hence the use of quantilization and debate in Dialogic RL, for example.

comment by rohinmshah · 2020-01-21T21:48:44.854Z · LW(p) · GW(p)
To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.

I completely agree with this, but isn't this also true of subjective regret bounds / definition-optimization? Like, when you write (emphasis mine)

Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.

Isn't the assumption about the prior "doing all the work"?

Maybe your point is that there are failure modes that aren't covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places. Just picking one sentence (emphasis mine):

An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.

I don't see any reason of principle that significantly limits the strong feasibility results we can expect.

And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.

That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world. The abstract argument is an attempt to explain it; but I wouldn't have much faith in the abstract argument by itself (which is trying to quantify over all possible ways of getting a strong feasibility result).

However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.

Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can't work for you though. (As an aside, I hope this is what MIRI is doing, but they probably aren't.)

Can you explain what you mean here?

Basically what you said right after:

I agree that just saying "subjective regret bound" is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-01-24T16:26:55.663Z · LW(p) · GW(p)

...first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.

I completely agree with this, but isn't this also true of subjective regret bounds / definition-optimization?

The idea is, we will solve the alignment problem by (i) formulating a suitable learning protocol (ii) formalizing a set of assumptions about reality and (iii) proving that under these assumptions, this learning protocol has a reasonable subjective regret bound. So, the role of the subjective regret bound is making sure that the what we came up with in i+ii is sufficient, and also guiding the search there. The subjective regret bound does not tell us whether particular assumptions are realistic: for this we need to use common sense and knowledge outside of theoretical computer science (such as: physics, cognitive science, experimental ML research, evolutionary biology...)

Maybe your point is that there are failure modes that aren't covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places.

I disagree with the OP that (emphasis mine):

I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.

I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.

And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.

I don't think strong feasibility results will have to talk about the environment, or rather, they will have to talk about it on a very high level of abstraction. For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result. We can then debate whether an using such a smooth approximation is sufficient for superhuman performance, but establishing this requires different tools, like I said above.

The way I imagine it, AGI theory should ultimately arrive at some class of priors that are on the one hand rich enough to deserve to be called "general" (or, practically speaking, rich enough to produce superhuman agents) and on the other hand narrow enough to allow for efficient algorithms. For example the Solomonoff prior is too rich, whereas a prior that (say) describes everything in terms of an MDP with a small number of states is too narrow. Finding the golden path in between is one of the big open problems.

That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world.

Does it? I am not sure why you have this impression. Certainly there are phenomena in the real world that we don't yet have enough theory to understand, and certainly a given theory will fail in domains where its assumptions are not justified (where "fail" and "justified" can be a manner of degree). And yet, theory obviously played and plays a central role in science, so I don't understand whence the fatalism.

However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.

Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can't work for you though.

That seems like it would be an extremely not cost-effective way of making progress. I would invest a lot of time and effort into something that would only be disclosed to the select few, for the sole purpose of convincing them of something (assuming they are even interested to understand it). I imagine that solving AI risk will require collaboration among many people, including sharing ideas and building on other people's ideas, and that's not realistic without publishing. Certainly I am not going to write a Friendly AI on my home laptop :)

comment by rohinmshah · 2020-01-24T19:03:40.298Z · LW(p) · GW(p)
I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.

Okay, so there seem to be two disagreements:

• How bad is it that intent alignment is ill-defined
• Is work on intent alignment urgent

The first one seems primarily about our disagreements on the utility of theory, which I'll get to later.

For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely (maybe the first few AGIs don't think about simulations; maybe it's impossible to construct such a convincing hypothesis). I especially don't see the argument that it is more likely than the failure mode in which a goal-directed AGI is optimizing for something different from what humans want.

(You might respond that intent alignment brings risk down from say 10% to 3%, whereas your agenda brings risk down from 10% to 1%. My response would be that once we have successfully figured out intent alignment to bring risk from 10% to 3%, we can then focus on building a good prior to bring the risk down from 3% to 1%. All numbers here are very made up.)

For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result.

My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won't hold for the vast majority of cases that neural networks have been applied to today.

I don't find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).

I am not sure why you have this impression.

Consider this example:

You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can't be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.

The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.

And yet, theory obviously played and plays a central role in science, so I don't understand whence the fatalism.

The areas of science in which theory is most central (e.g. physics) don't require assumptions about some complicated stuff; they simply aim to describe observations. It's really the assumptions that make me pessimistic, which is why it would be a significant update if I saw:

a mathematical definition of safety that I thought actually captured "safety" without "passing the buck"

It would similarly update me if you had a piece of code that (perhaps with arbitrary amounts of compute) could take in an AI system and output "safe" or "unsafe", and I would trust that output. (I'd expect that a mathematical definition could be turned into such a piece of code if it doesn't "pass the buck".)

You might respond that intent alignment requires assumptions too, which I agree with, but the theory-based approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is K-Lipschitz) whereas a non-theory-based approach can use "handwavy" assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.

That seems like it would be an extremely not cost-effective way of making progress.

Yeah, I broadly agree; I mostly don't understand MIRI's position and thought you might share it, but it seems you don't. I agree that overall it's a tough problem. My personal position would be to do it publicly anyway; it seems way better to have an approach to AI that we understand than the current approach, even if it shortens timelines. (Consider the unilateralist curse; but also consider that other people do agree with me, if not the people at MIRI / LessWrong.)

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-01-25T19:05:05.990Z · LW(p) · GW(p)

For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely.

First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is. Hence, I am skeptical about the usefulness of intent-alignment.

For a more "mundane" example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intent-aligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between "being wrong about what the user wants" and optimizing something completely unrelated to what the user wants?

It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.

For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result.

My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won't hold for the vast majority of cases that neural networks have been applied to today.

I don't know why you think so, but at least this is a good crux since it seems entirely falsifiable. In an any case, exponential sample complexity definitely doesn't count as "strong feasibility".

I don't find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).

Smoothness is just an example, it is not necessarily the final answer. But also, in classification problems smoothness usually translates to a margin requirement (the classes have to be separated with sufficient distance). So, in some sense smoothness allows for "if conditions" as long as you're not too sensitive to the threshold.

You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can't be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.

I don't understand this example. If the bridge can never collapse as long as the outside forces don't exceed K, then resonance is covered as well (as long as it is produced by forces below K). Maybe you meant that the outside forces are also assumed to be stationary.

The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.

Nevertheless most engineering projects make heavy use of theory. I don't understand why you think that AGI must be different?

The issue of assumptions in strong feasibility is equivalent to the question of, whether powerful agents require highly informed priors. If you need complex assumptions then effectively you have a highly informed prior, whereas if your prior is uninformed then it corresponds to simple assumptions. I think that Hanson (for example) believes that it is indeed necessary to have a highly informed prior, which is why powerful AI algorithms will be complex (since they have to encode this prior) and progress in AI will be slow (since the prior needs to be manually constructed brick by brick). I find this scenario unlikely (for example because humans successfully solve tasks far outside the ancestral environment, so they can't be relying on genetically built-in priors that much), but not ruled out.

However, I assumed that your position is not Hansonian: correct me if I'm wrong, but I assumed that you believed deep learning or something similar is likely to lead to AGI relatively soon. Even if not, you were skeptical about strong feasibility results even for deep learning, regardless of hypothetical future AI technology. But, it doesn't look like deep learning relies on highly informed priors. What we have is, relatively simple algorithms that can, with relatively small (or even no) adaptations solve problems in completely different domains (image processing, audio processing, NLP, playing many very different games, protein folding...) So, how is it possible that all of these domains have some highly complex property that they share, and that is somehow encoded in the deep learning algorithm?

It's really the assumptions that make me pessimistic, which is why it would be a significant update if I saw a mathematical definition of safety that I thought actually captured "safety" without "passing the buck"

I'm curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?

...but the theory-based approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is K-Lipschitz) whereas a non-theory-based approach can use "handwavy" assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.

I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption. In general, which assumptions are formalizable depends on the ontology of your mathematical model (that is, which real-world concepts correspond to the "atomic" ingredients of your model). The choice of ontology is part of drawing the line between what you want your mathematical theory to prove and what you want to bring in as outside assumptions. Like I said before, this line definitely has to be drawn somewhere, but it doesn't at all follow that the entire approach is useless.

comment by rohinmshah · 2020-01-26T21:26:05.695Z · LW(p) · GW(p)
First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk.

Okay. What's the argument that the risk is great (I assume this means "very bad" and not "very likely" since by hypothesis it is unlikely), or that we need a lot of time to solve it?

Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is.

I agree with this; I don't think this is one of our cruxes. (I do think that in most cases, if we have all the information about the situation, it will be fairly clear whether something is intent aligned or not, but certainly there are situations in which it's ambiguous. I think corrigibility is better-informally-defined, though still there will be ambiguous situations.)

Is IRL intent aligned?

Depends on the details, but the way you describe it, no, it isn't. (Though I can see the fuzziness here.) I think it is especially clear that it is not corrigible.

It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.

Yup, I agree (with the caveat that it doesn't have to be a human's interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.

If the bridge can never collapse as long as the outside forces don't exceed K, then resonance is covered as well (as long as it is produced by forces below K).

I meant that K was set considering wind forces, cars, etc. and was set too low to account for resonance, because you didn't think about resonance beforehand.

(I guess resonance doesn't involve large forces, it involves coordinated forces. The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn't think of it.)

Nevertheless most engineering projects make heavy use of theory.

I'm not denying that? I'm not arguing against theory in general; I'm arguing against theoretical safety guarantees. I think in practice our confidence in safety often comes from empirical tests.

I'm curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?

Probably? Honestly, I'm don't think you even need to prove the subjective regret bound; if you wrote down assumptions that I agree are realistic and capture safety (such that you could write code that determines whether or not an AI system is safe) that alone would qualify. It would be fine if it sometimes said things are unsafe when they are safe, as long as it isn't too conservative; a weak feasibility result would help show that it isn't too conservative.

I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption.

Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then "human thinking for a day" is not something you can express.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-02-01T16:02:29.492Z · LW(p) · GW(p)

Okay. What's the argument that the risk is great (I assume this means "very bad" and not "very likely" since by hypothesis it is unlikely), or that we need a lot of time to solve it?

The reasons the risk are great are standard arguments, so I am a little confused why you ask about this. The setup effectively allows a superintelligent malicious agent (Beta) access to our universe, which can result in extreme optimization of our universe towards inhuman values and tremendous loss of value-according-to-humans. The reason we need a lot of time to solve it is simply that (i) it doesn't seem to be an instance of some standard problem type which we have standard tools to solve and (ii) some people have been thinking on these questions for a while by now and did not come up with an easy solution.

It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.

Yup, I agree (with the caveat that it doesn't have to be a human's interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.

Then, I don't understand why you believe that work on anything other than intent-alignment is much less urgent?

The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn't think of it.

"Resonance" is not something you need to explicitly include in your model, it is just a consequence of the equations of motion for an oscillator. This is actually an important lesson about why we need theory: to construct a useful theoretical model you don't need to know all possible failure modes, you only need a reasonable set of assumptions.

I think in practice our confidence in safety often comes from empirical tests.

I think that in practice our confidence in safety comes from a combination of theory and empirical tests. And, the higher the stakes and the more unusual the endeavor, the more theory you need. If you're doing something low stakes or something very similar to things that have been tried many times before, you can rely on trial and error. But if you're sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. Yes, you will test the modules on Earth in conditions as similar to the real environment as you can (respectively, you will do experiments with narrow AI). But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.

I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption.

Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then "human thinking for a day" is not something you can express.

I disagree. For example, suppose that we have a theorem saying that an ANN with particular architecture and learning algorithm can learn any function inside some space with given accuracy. And, suppose that "human thinking for a day" is represented by a mathematical function that we assume to be inside and that we assume to be "safe" in some formal sense (for example, it computes an action that doesn't lose much long-term value). Then, your model can prove that imitation learning applied to human thinking for a day is safe. Of course, this example is trivial (modulo the theorem about ANNs), but for more complex settings we can get results that are non-trivial.

comment by rohinmshah · 2020-02-02T08:55:52.162Z · LW(p) · GW(p)

Sorry, I meant what are the reasons that the risk greater than the risk from a failure of intent alignment? The question was meant to be compared to the counterfactual of work on intent alignment, since the underlying disagreement is about comparing work on intent alignment to other AI safety work. Similarly for the question about why it might take a long time to solve.

Then, I don't understand why you believe that work on anything other than intent-alignment is much less urgent?

I'm claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.

Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:

1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.

2. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.

In this situation, machine A is a much better plan.

(The example is meant to illustrate the phenomenon by which you might want to choose a riskier but easier-to-create option; it's not meant to properly model intent alignment vs. other stuff on other axes.)

This is actually an important lesson about why we need theory: to construct a useful theoretical model you don't need to know all possible failure modes, you only need a reasonable set of assumptions.

I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs. (Maybe today bridge-building technology has advanced such that we are able to do such proofs, I don't know, but at least in the past that would not have been the case.)

In this case, we either fail to prove anything, or we make unrealistic assumptions that do not hold in reality and get a proof of safety. Similarly, I think in many cases involving properties about a complex real environment, your two options are 1. don't prove things or 2. prove things with unrealistic assumptions that don't hold.

But if you're sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. [...] Without theory you cannot extrapolate.

I am not suggesting that we throw away all logic and make random edits to lines of code and try them out until we find a safe AI. I am simply saying that our things-that-allow-us-to-extrapolate need not be expressed in math with theorems. I don't build mathematical theories of how to write code, and usually don't prove my code correct; nonetheless I seem to extrapolate quite well to new coding problems.

It also sounds like you're making a normative claim for proofs; I'm more interested in the empirical claim [AF(p) · GW(p)]. (But I might be misreading you here.)

I disagree. For example, [...]

Certainly you can come up with bridging assumptions to bridge between levels of abstraction (in this case the assumption that "human thinking for a day" is within F). I would expect that I would find some bridging assumption implausible in these settings.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-02-07T14:30:49.775Z · LW(p) · GW(p)

I'm claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.

Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:

1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
1. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.

In this situation, machine A is a much better plan.

I am struggling to understand how does it work in practice. For example, consider dialogic [AF(p) · GW(p)] RL [AF(p) · GW(p)]. It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:

2. Natural language annotation
3. Quantilization of questions
4. Debate over annotations
5. Dealing with no user answer
6. Dealing with inconsistent user answers
7. Dealing with changing user beliefs
8. Dealing with changing user preferences
9. Self-reference in user beliefs
10. Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal)
11. Reverse questions
12. Translation of counterfactuals from user frame to AI frame

EDIT: 14. Confidence threshold for risky actions

Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can't tell.

I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs.

I am not an expert but I expect that bridges are constructed so that they don't enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation). We want bridges that don't fall, don't we?

I don't build mathematical theories of how to write code, and usually don't prove my code correct

On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.

It also sounds like you're making a normative claim for proofs; I'm more interested in the empirical claim.

I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.

comment by rohinmshah · 2020-02-07T18:22:20.104Z · LW(p) · GW(p)
Which of these features are necessary for intent-alignment and which are only necessary for strong alignment?

As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is "doesn't manipulate the user" or something like that.) I'm not sure what 9, 11 and 13 are about. For the others, I'd say they're all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.

I am not an expert but I expect that bridges are constructed so that they don't enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).

This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven't yet seen a bridge fall down from resonance, and so you don't think about it.

On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.

Maybe I'm falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.

Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a "mathematical theory of how to write code".

But, even just understanding whether we live in such a universe requires building a mathematical theory.

I'm curious what you think doesn't require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don't have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that's how I interpret the rocket alignment and security mindset posts.)

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-02-08T13:47:45.795Z · LW(p) · GW(p)

As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is "doesn't manipulate the user" or something like that.) I'm not sure what 9, 11 and 13 are about. For the others, I'd say they're all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.

Hmm. I appreciate the effort, but I don't understand this answer. Maybe discussing this point further is not productive in this format.

I am not an expert but I expect that bridges are constructed so that they don't enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).

This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven't yet seen a bridge fall down from resonance, and so you don't think about it.

Yes, and in that perspective, the mathematical model can tell me about resonance. It's actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn't mean we should be careless about AGI as well.

I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of unrealistic mathematical models, empirical data and informal reasoning. But I don't understand why should we give up so soon? We can work towards realistic mathematical models and prepare fallbacks, and even if we don't arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.

Maybe I'm falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.

Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a "mathematical theory of how to write code".

First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don't literally start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I could prove a theorem if I wanted to. My model is not exactly "intuitive": I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don't write proofs that are machine verifiable (some people do that today, but it's a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).

Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a proof that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.

Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It's true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of how much the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.

Third, we could also literally prove machine-verifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a "side benefit" of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification "merely" tells us that the implementation doesn't have bugs (which is something we should definitely worry about too, when it becomes relevant).

I'm curious what you think doesn't require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don't have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that's how I interpret the rocket alignment and security mindset posts.)

I'm not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don't see how they could be settled only with this kind of reasoning.

comment by rohinmshah · 2020-02-09T01:50:26.162Z · LW(p) · GW(p)

But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.

I'm struggling to understand what you mean by "theory" here, and the programming example was trying to get at that, but not very successfully. So let's take the sandwich example:

I made a sandwich this morning without building mathematical theory :)

Presumably the ingredients were in a slightly different configuration than you had ever seen them before, but you were still able to "extrapolate" to figure out how to make a sandwich anyway. Why didn't you need theory for that extrapolation?

Obviously this is a silly example, but I don't currently see any qualitative difference between sandwich-making-extrapolation, and the sort of extrapolation we do when we make qualitative arguments about AI risk. Why trust the former but not the latter? One is answer is that the latter is more complex, but you seem to be arguing something else.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-02-14T18:58:04.267Z · LW(p) · GW(p)

I decided that the answer deserves its own post [AF · GW].

comment by Wei_Dai · 2019-08-11T18:11:24.311Z · LW(p) · GW(p)

I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.

Worth noting here that (it looks like) Paul eventually settled upon "intent alignment [LW · GW]" as the term for this.

comment by rohinmshah · 2020-01-04T19:51:19.661Z · LW(p) · GW(p)

I hadn't realized this post was nominated, partially because of my comment [LW(p) · GW(p)], so here's a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.

Since writing the comment, I've come across another argument for thinking about intent alignment -- it seems like a "generalization" of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explicitly maintains a distribution over possible human reward functions, and instrumentally gathers information about human preferences by interacting with the human. With intent alignment, since the agent is trying to help the human, we expect the agent to instrumentally maintain a belief over what the human cares about, and gather information to refine this belief. We might hope that there are ways to achieve intent alignment that instrumentally incentivizes all the nice behaviors of assistance games, without requiring the modeling assumptions that CIRL does (e.g. that the human has a fixed known reward function).

Changes I'd make to my comment:

It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction.

I still think that the intent alignment / motivation problem is the most urgent, but there are certainly other problems that matter as well, so I would probably remove or clarify that point.

comment by Wei_Dai · 2018-11-16T05:10:01.483Z · LW(p) · GW(p)

I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.

I think it would be helpful for understanding your position and what you mean by "AI alignment" to have a list or summary of those other subproblems and why you think they're much less urgent. Can you link to or give one here?

Also, do you have a prefered term for the broader definition, or the de re reading? What should we call those things if not "AI alignment"?

comment by paulfchristiano · 2018-11-18T20:55:38.059Z · LW(p) · GW(p)
I think it would be helpful for understanding your position and what you mean by "AI alignment" to have a list or summary of those other subproblems and why you think they're much less urgent. Can you link to or give one here?

Other problems related to alignment, which would be included by the broadest definition of "everything related to making the future good."

• We face a bunch of problems other than AI alignment (e.g. other destructive technologies, risk of value drift), and depending on the competencies of our AI systems they may be better or worse than humans at helping handle those problems (relative to accelerating the kinds of progress that force us to confront those problems). So we'd like AI to be better at (helping us with) {diplomacy, reflection, institution design, philosophy...} relative to {physical technology, social manipulation, logistics...}
• Beyond alignment, AI may provide new advantages to actors who are able to make their values more explicit, or who have explicit norms for bargaining/aggregation, and so we may want to figure out how to make more things more explicit.
• AI could facilitate social control, manipulation, or lock-in, which may make it more important for us to have more robust or rapid forms of deliberation (that are robust to control/manipulation, or that can run their course fast enough to prevent someone from making a mistake). This also may increase the incentives for ordinary conflict amongst actors with differing long-term values.
• AI will tend to empower groups with few people (but lots of resources), making it easier for someone to destroy the world and so requiring stronger enforcement/stabilization.
• AI may be an unusually good opportunity for world stabilization, e.g. because its associated with a disruptive transition, in which case someone may want to take that opportunity. (Though I'm concerned about this because, in light of disagreement/conflict about stabilization itself, someone attempting to do this or being expected to attempt to do this could undermine our ability to solve alignment.)

That's a very partial list. This is for the broadest definition of "everything about AI that is relevant to making the future good," which I don't think is particularly defensible. I'd say the first three could be included in defensible definitions of alignment, and there are plenty of others.

My basic position on most of these problems is: "they are fine problems and you might want to work on them, but if someone is going to claim they are important they need to give a separate argument, it's not at all implied by the normal argument for the importance of alignment." I can explain in particular cases why I think other problems are less important, and I feel like we've had a lot of back and forth on some of these, but the only general argument is that I think there are strong reasons to care about alignment in particular that don't extend to these other problems (namely, a failure to solve alignment has predictable really bad consequences in the short term, and currently it looks very tractable in expectation).

Also, do you have a preferred term for the broader definition, or the de re reading? What should we call those things if not "AI alignment"?

Which broader definition? There are tons of possibilities. I think the one given in this post is the closest to a coherent definition that matches existing usage.

The other common definition seems to be more along the lines of "everything related to make AI go well" which I don't think really deserves a word--just call that "AI trajectory change" if you want to distinguish it from "AI speedup", or "pro-social AI" if you want to distinguish from "AI as an intellectual curiosity," or just "AI" if you don't care about those distinctions.

For the de re reading, I don't see much motive to lump the competence and alignment parts of the problem into a single heading, I would just call them "alignment" and "value learning" separately. But I can see how this might seem like a value judgment, since someone who thought that these two problems were the very most important problems might want to put them under a single heading even if they didn't think there would be particular technical overlap.

(ETA: I'd also be OK with saying "de dicto alignment" or "de re alignment," since they really are just importantly different concepts both of which are used relatively frequently---there is a big difference between an employee who de dicto wants the same things their boss wants, and an employee who de re wants to help their boss get what they want, those feel like two species of alignment.)

comment by shminux · 2018-11-15T16:53:34.245Z · LW(p) · GW(p)

Is there a concept of a safe partially aligned AI? Where it recognizes its own limitations of understanding of the human[-ity] and limit its actions to what it knows is within those limits with high probability?

comment by Ben Pace (Benito) · 2019-12-02T18:56:51.840Z · LW(p) · GW(p)

Nominating this primarily for Rohin’s comment on the post, which was very illuminating.

comment by rohinmshah · 2019-12-02T04:18:32.019Z · LW(p) · GW(p)

Crystallized my view of what the "core problem" is (as I explained in a comment on this post). I think I had intuitions of this form before, but at the very least this post clarified them.

comment by leggi · 2020-01-02T16:20:04.929Z · LW(p) · GW(p)

I'm not tech. savvy and am well aware that maybe it's a lack of understanding that lets me live without fear of AI but it seems an important issue round here and I would like to have some understanding. And a little understanding of my perspective - I grew up in shadow of the Cold War i.e. mutually assured destruction in 6 minutes or less (it might have been 12 minutes - I can't quite remember anymore).

This post caught my eye on the review list.

I need to clarify something before reading forward.

getting your AI to try to do the right thing,

Is: 'getting your AI to try to do the WANTED thing' be the accurate wording?

The usage of "right" adds a dimension of morality in my mind that doesn't come with "want".

comment by rohinmshah · 2020-01-04T19:35:48.212Z · LW(p) · GW(p)

Yeah, it's not meant to add that dimension of morality.

Perhaps it should be "getting your AI to try to help you". Trying to do the "wanted" thing is also reasonable.

comment by green_leaf · 2019-04-07T22:04:52.124Z · LW(p) · GW(p)

Are there any plans to generalize this kind of alignment later to include CEV or some other plausible metaethics, or should this be "the final stop"?