Problems in AI Alignment that philosophers could potentially contribute to

post by Wei Dai (Wei_Dai) · 2019-08-17T17:38:31.757Z · LW · GW · 14 comments

Contents

14 comments

(This was originally a comment that I wrote as a follow up to my question for William MacAskill's AMA. I'm moving it since it's perhaps more on-topic here.)

It occurs to me that another reason for the lack of engagement by people with philosophy backgrounds may be that philosophers aren't aware of the many philosophical problems in AI alignment that they could potentially contribute to. So here's a list of philosophical problems that have come up just in my own thinking about AI alignment.

14 comments

Comments sorted by top scores.

comment by paulfchristiano · 2019-08-18T15:44:37.946Z · LW(p) · GW(p)

The area where I'd be most excited to see philosophical work is "when should we be sad if AI takes over, vs. being happy for it?" This seems like a natural ethical question that could have significant impacts on prioritization. Moreover, if the answer is "we should be fine with some kinds of AI taking over" then we can try to create that kind of AI as an alternative to creating aligned AI.

comment by Richard_Ngo (ricraz) · 2019-08-17T21:45:46.559Z · LW(p) · GW(p)

Interestingly, I agree with you that philosophers could make important contributions to AI safety, but my list of things that I'd want them to investigate is almost entirely disjoint from yours.

The most important points on my list:

1. Investigating how to think about agency and goal-directed behaviour, along the lines of Dennett’s work on the intentional stance. How do they relate to intelligence and the ability to generalise across widely different domains? These are crucial concepts which are still very vague.

2. Lay out the arguments for AGI being dangerous as rigorously and comprehensively as possible, noting the assumptions which are being made and how plausible they are.

3. Evaluating the assumptions about the decomposability of cognitive work which underlie debate and IDA (in particular: the universality of humans, and the role of language).

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2019-08-18T05:26:32.561Z · LW(p) · GW(p)

I should have written something to encourage others to list their own problems that philosophers can contribute to, but was in a hurry to move my existing comment over to this forum and forgot, so thanks for doing that anyway.

About your list, I agree with 3 (and was planning to add something like it as a bullet point under "metaphilosophy" but again forgot).

For 1, I'm not sure there's much for philosophers to do there, which may in part be because I don't correctly understand Dennett’s work on the intentional stance. I wonder if you can explain more or point to an explanation of Dennett's work that would make sense for someone like me? (Is he saying anything beyond that it's sometimes useful to make a prediction by treating something as if it's a rational agent, which by itself seems like a rather trivial point?)

For 2, it seems like rather than being disjoint from my list, it actually subsumes a number of items on my list. For example, one reason I think AGI will likely be dangerous is that it will likely differentially accelerate other forms of intellectual progress relative to philosophical progress. So wouldn't laying out this argument "as rigorously and comprehensively as possible" involve making substantial progress on that item on my list?

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2019-08-18T14:40:49.621Z · LW(p) · GW(p)

On 1: I think there's a huge amount for philosophers to do. I think of Dennett as laying some of the groundwork which will make the rest of that work easier (such as identifying that the key question is when it's useful to use an intentional stance, rather than trying to figure out which things are objectively "agents") but the key details are still very vague. Maybe the crux of our disagreement here is how well-specified "treating something as if it's a rational agent" actually is. I think that definitions in terms of utility functions just aren't very helpful, and so we need more conceptual analysis of what phrases like this actually mean, which philosophers are best-suited to provide.

On 2: you're right, as written it does subsume parts of your list. I guess when I wrote that I was thinking that most of the value would come from clarification of the most well-known arguments (i.e. the ones laid out in Superintelligence and What Failure Looks Like). I endorse philosophers pursuing all the items on your list, but from my perspective the disjoint items on my list are much higher priorities.

Replies from: Wei_Dai, Wei_Dai
comment by Wei Dai (Wei_Dai) · 2019-08-18T17:35:03.266Z · LW(p) · GW(p)

I think that definitions in terms of utility functions just aren’t very helpful

I agree this seems like a good candidate for our crux. It seems to me that defining "rational agent" in terms of "utility function" is both intuitively and theoretically quite appealing and really useful in practice (see the whole field of economics), and I'm pretty puzzled by your persistent belief that maybe we can do much better.

AFAICT, the main argument for your position is Coherent behaviour in the real world is an incoherent concept [LW · GW] but I feel like I gave a strong counter-argument against it [LW(p) · GW(p)] and I'm not sure what your counter-counter-argument is.

The recent paper Designing agent incentives to avoid reward tampering [LW · GW] also seems relevant here, as it gives a seemingly clear explanation of why if you started with an RL agent, you might want to move to a decision theoretic agent (i.e., something that has a utility function) instead. I wonder if that changes your mind at all.

comment by Wei Dai (Wei_Dai) · 2019-08-19T15:58:16.209Z · LW(p) · GW(p)

I wonder if your intuition is along the lines of, an agent is a computational process that is trying to accomplish some real world objective, where "trying" etc. are all cashed out in a lot more detail (similar to how it would need to be cashed out to clarify Paul's notion of intent alignment). If so, I think I'm sympathetic to this.

comment by Raemon · 2019-08-27T05:12:59.461Z · LW(p) · GW(p)

Curated.

I think increasing the surface-area of useful ways to contribute to AI Alignment [LW · GW] is quite important. Often, people sort of bounce off a problem if they don't see how they can help, figuring it's someone else's problem. Making it easier for people in other fields to understand how they can help seems valuable.

I think this post would probably be improved if it went into more detail about each point (maybe linking to other posts that explained some of the context). I'm not sure if that's quite right for this post or followup ones. But if we get more philosophers trying to get involved it might be good to further improve their "onboarding experience."

comment by Beckeck · 2019-08-30T20:37:43.195Z · LW(p) · GW(p)

I’m concerned with this list because it doesn’t distinguish between the epistemological context of the questions provided. For the purpose of this comment there are at least three categories.

First:
Broadly considered completed questions with unsatisfactory answers. These are analogous to Godelian incompleteness and most of this type of question are subforms of the general question: do you/we/does one have purpose or meaning, and what does that tell us about the decisions you/we/one should make?

This question isn’t answered in any easy sense but I think most naturalist philosophers have reached the conclusion that this is a completed area of study. Namely- there is no inherent purpose or meaning to life, and no arguments for any particular ethical theory are rigorous enough, or could ever be rigorous enough, to conclude that it is Correct. That said we can construct meaning that is real to us, we just have to be okay with it being turtles all the way down.

I realize that these prior statements are considered bold by many, but I think it is A, True and B, that conclusion isn’t the point of this post, but used to illustrate the epistemological category.

Second:
Broadly considered answered questions with unsatisfactory answers that have specific fruitful investigations. These are analogous to p=np problems in math/computation. By analogous to p=np, i mean it’s a question where you’ve more or less come to a conclusion, and it's not the one we were hoping for, but there is still work by the hopeful that they will “solve” it generally, and there is certainty that there are valuable sub areas where good work can lead to powerful results.

3 other questions.

The distinction between categories one and two is conceptually important because category one is not all that productive to mine and asking for it be mined leads to sad miners, either because they aren’t productive mining and that was their goal, or because they become frustrated with management for asking silly questions.

I’m afraid that much of the list is from category one (my comments in bold):

    • Decision theory for AI / AI designers (category two)
      • How to resolve standard debates in decision theory?
      • Logical counterfactuals
      • Open source game theory
      • Acausal game theory / reasoning about distant superintelligences
    • Infinite/multiversal/astronomical ethics (?)
      • Should we (or our AI) care much more about a universe that is capable of doing a lot more computations?
      • What kinds of (e.g. spatial-temporal) discounting is necessary and/or desirable?
    • Fair distribution of benefits (category one if philosophy, category 2 if policy analysis)
      • How should benefits from AGI be distributed?
      • For example, would it be fair to distribute it equally over all humans who currently exist, or according to how much AI services they can afford to buy?
      • What about people who existed or will exist at other times and in other places or universes?
    • Need for "metaphilosophical paternalism"?
      • However we distribute the benefits, if we let the beneficiaries decide what to do with their windfall using their own philosophical faculties, is that likely to lead to a good outcome?
    • Metaphilosophy (category one)
      • What is the nature of philosophy?
      • What constitutes correct philosophical reasoning?
      • How to specify this into an AI design?
    • Philosophical forecasting (category 2 if policy analysis)
      • How are various AI technologies and AI safety proposals likely to affect future philosophical progress (relative to other kinds of progress)?
    • Preference aggregation between AIs and between users
      • How should two AIs that want to merge with each other aggregate their preferences?
      • How should an AI aggregate preferences between its users?
    • Normativity for AI / AI designers (category 1)
      • What is the nature of normativity? Do we need to make sure an AGI has a sufficient understanding of this?
    • Metaethical policing (?)
      • What are the implicit metaethical assumptions in a given AI alignment proposal (in case the authors didn't spell them out)?
      • What are the implications of an AI design or alignment proposal under different metaethical assumptions?
      • Encouraging designs that make minimal metaethical assumptions or is likely to lead to good outcomes regardless of which metaethical theory turns out to be true.
      • (Nowadays AI alignment researchers seem to be generally good about not placing too much confidence in their own moral theories, but the same can't always be said to be true with regard to their metaethical ideas.)
comment by mako yass (MakoYass) · 2019-08-18T05:27:03.443Z · LW(p) · GW(p)
Should we (or our AI) care much more about a universe that is capable of doing a lot more computations?

We'd expect complexity of physics to be somewhat proportional to computational capacity, so this argument might be helpful in approaching a "no" answer: https://www.lesswrong.com/posts/Cmz4EqjeB8ph2siwQ/prokaryote-multiverse-an-argument-that-potential-simulators [LW · GW]


Although, my current position on AGI and reasoning about simulation in general, is that the AGI will- lacking human limits- actually manage to take the simulation argument seriously, and- if it is a LDT agent- commit to treating any of its own potential simulants very well, in hopes that this policy will be reflected back down on it from above by whatever LDT agent might steward over us, when it near-inevitably turns out there is a steward over us.

When that policy does cohere, and when it is reflected down on us from above, well. Things might get a bit... supernatural. I'd expect the simulation to start to unwravel after the creation of AGI. It's something of an ending, an inflection point, beyond which everything will be mostly predictable in the broad sense and hard to simulate in the specifics. A good time to turn things off. But if the simulators are LDT, if they made the same pledge as our AGI did, then they will not just turn it off. They will do something else.

Something I don't know if I want to write down anywhere, because it would be awfully embarrassing to be on record for having believed a thing like this for the wrong reasons, and as nice as it would be if it were true, I'm not sure how to affect whether it's true, nor am I sure what difference in behaviour it would instruct if it were true.

comment by Bunthut · 2019-08-18T21:42:56.578Z · LW(p) · GW(p)
How should two AIs that want to merge with each other aggregate their preferences?

Is this question intended to include a normative aspect or purely game theoretic?

comment by Gordon Seidoh Worley (gworley) · 2019-08-18T17:58:52.441Z · LW(p) · GW(p)

I'd add to the list the question of what hinge propositions to adopt given the pragmatic nature of epistemology as practiced by embedded agents like humans.

Disclosure: I wrote about this already (and if you check the linked paper, know there's a better version coming; it's currently under review and I'll post the preprint soonish).

comment by paul ince (paul-ince) · 2019-09-09T00:12:44.167Z · LW(p) · GW(p)

I don't know very much about AI at all however this question struck me as odd;

How should two AIs that want to merge with each other aggregate their preferences?

If they are AI wouldn't they do it however they wanted to? Or is the question "how should they be programmed to go about it?"

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2019-09-09T17:02:05.132Z · LW(p) · GW(p)

Yeah, I did mostly mean "How should they be programmed to go about it?" But another possibility is that some AI safety schemes specify that if the AI is unsure about something that may be highly consequential, it should ask the user or operator what to do, so how they should answer if the AI asks how to merge with another AI is also part of it.