What are the high-level approaches to AI alignment?

post by G Gordon Worley III (gworley) · 2020-06-16T17:10:32.467Z · score: 13 (4 votes) · LW · GW · 2 comments

This is a question post.


    16 riceissa
    6 evhub
    4 G Gordon Worley III

I'm writing a post comparing some high-level approaches to AI alignment in terms of their false positive risk [LW · GW]. Trouble is, there's no standard agreement on what various high-level approaches to AI alignment there are today, either in terms of what constitutes these high-level approaches or where to draw the line in categorizing various specific approaches.

So, I'll open it up as a question to get some feedback before I get too far along. What do you consider to be the high-level approaches to AI alignment?

(I'll supply my own partial answer below.)


answer by riceissa · 2020-06-17T00:21:06.321Z · score: 16 (6 votes) · LW(p) · GW(p)
comment by G Gordon Worley III (gworley) · 2020-06-17T01:07:16.110Z · score: 4 (2 votes) · LW(p) · GW(p)

Thanks. Your post specifically is pretty helpful because it helps with one of the things that was tripping me up, which is what standard names people call different methods. Your names do a better job of capturing them than mine did.

answer by evhub · 2020-06-16T22:00:11.464Z · score: 6 (5 votes) · LW(p) · GW(p)

You might be interested in this post [AF · GW] I wrote recently that goes into significant detail on what I see as the major leading proposals for building safe advanced AI under the current machine learning paradigm.

comment by G Gordon Worley III (gworley) · 2020-06-17T01:03:40.898Z · score: 2 (1 votes) · LW(p) · GW(p)

Actually this post was not especially helpful for my purpose and I should have explained why in advance because I anticipated someone would link it. Although it helpfully lays out a number of proposals people have made, it does more to work out what's going on with those proposals rather than find ways they can be grouped together (except incidentally). I even reread this post before posting this question and it didn't help me improve on the taxonomy I proposed, which I already had in mind as of a few months ago.

answer by G Gordon Worley III · 2020-06-16T17:16:38.034Z · score: 4 (2 votes) · LW(p) · GW(p)

My initial thought is that there are at least 3, which I'll give the follow names (with short explanations):

  • Iterated Distillation and Amplification (IDA)
    • Build an AI, have it interact with a human, create a new AI based on the interaction of the human and the AI, and repeat until the AI is good enough or it reaches a fixed point and additional iterations don't change it.
  • Inverse Reinforcement Learning (IRL)
    • Build an AI that tries to infer human values from observations and then acts based on those inferred values.
  • Decision Theorized Agent (DTA)
    • Build an AI that uses a decision theory that causes it to make choices that will be aligned with human interests.

All of these are woefully underspecified, so improved summaries of these approaches that you think accurately explain these approaches also appreciated.

comment by Richard_Ngo (ricraz) · 2020-06-16T19:15:59.701Z · score: 10 (2 votes) · LW(p) · GW(p)

I suspect that nobody is actually pursuing the third one as you've described it. Rather, my impression is that MIRI researchers tend to think of decision theory as a more fundamental problem in understanding AI, not directly related to human interests.

comment by Ben Pace (Benito) · 2020-06-16T19:19:49.209Z · score: 10 (2 votes) · LW(p) · GW(p)

I think the last one seems odd / doesn't make much sense. All agents have a decision theory, including RL-based agents, so it's not a distinctive approach. 

If you were attempting to describe MIRI's work, remember that they're trying to understand basic concepts of agency better (meta level [LW · GW], object level [LW · GW]), not in order to directly put the new concepts into the AI (in the same way current AIs do not always have a section for the 'probability theory' to be written in) but in order to be much less confused about what we're trying to do.

So if you want to describe MIRI's work, you'd call it "getting less confused about the basic concepts" and then later building an AI via a mechanism we'll figure out after getting less confused. Right now it's not engineering, it's science.

comment by G Gordon Worley III (gworley) · 2020-06-17T01:09:36.457Z · score: 2 (1 votes) · LW(p) · GW(p)

That's true, but there's a natural and historical relationship here with what was in the past termed "seed AI", even if this is not an approach anyone is actively pursuing, which is the kind of thing I was hoping to point at without using that outmoded term.

comment by Rob Bensinger (RobbBB) · 2020-06-17T02:42:19.237Z · score: 5 (3 votes) · LW(p) · GW(p)

I agree with Ben and Richard's summaries; see https://www.lesswrong.com/posts/uKbxi2EJ3KBNRDGpL/comment-on-decision-theory [LW · GW]:

We aren't working on decision theory in order to make sure that AGI systems are decision-theoretic, whatever that would involve. We're working on decision theory because there's a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding of what high-quality reasoning is and how it works.

[...] The idea behind looking at (e.g.) counterfactual reasoning is that counterfactual reasoning is central to what we're talking about when we talk about "AGI," and going into the development process without a decent understanding of what counterfactual reasoning is and how it works means you'll to a significantly greater extent be flying blind when it comes to designing, inspecting, repairing, etc. your system. The goal is to be able to put AGI developers in a position where they can make advance plans and predictions, shoot for narrow design targets, and understand what they're doing well enough to avoid the kinds of kludgey, opaque, non-modular, etc. approaches that aren't really compatible with how secure or robust software is developed.

Nate's way of articulating it:

"The reason why I care about logical uncertainty and decision theory problems is something more like this: The whole AI problem can be thought of as a particular logical uncertainty problem, namely, the problem of taking a certain function f : QR and finding an input that makes the output large. To see this, let f be the function that takes the AI agent’s next action (encoded in Q) and determines how 'good' the universe is if the agent takes that action. The reason we need a principled theory of logical uncertainty is so that we can do function optimization, and the reason we need a principled decision theory is so we can pick the right version of the 'if the AI system takes that action...' function."

The work you use to get to AGI presumably won't look like probability theory, but it's still the case that you're building a system to do probabilistic reasoning, and understanding what probabilistic reasoning is is likely to be very valuable for doing that without relying on brute force and trial-and-error.

[...] Eliezer adds: "I do also remark that there are multiple fixpoints in decision theory. CDT does not evolve into FDT but into a weirder system Son-of-CDT. So, as with utility functions, there are bits we want that the AI does not necessarily generate from self-improvement or local competence gains."

When I think about key distinctions and branching points in alignment, I usually think about things like:

  • Does the approach require human modeling? Lots of risks can be avoided if the system doesn't do human modeling, or if it only does small amounts of human modeling; but this constrains the options for value learning and learning-in-general.
  • Current ML is notoriously opaque. Different approaches try to achieve greater understanding and inspectability to different degrees and in different ways (e.g., embedded agency vs. MIRI's "new research directions" vs. the kind of work OpenAI Clarity does [LW · GW]), or try to achieve alignment without needing to crack open the black box.
  • Is the goal to make a task-directed AGI system, vs. an open-ended optimizer? When you say "there's a natural and historical relationship here with what was in the past termed 'seed AI', even if this is not an approach anyone is actively pursuing", it calls to mind for me the transition from MIRI thinking about open-ended optimizers to instead treating task AGI as the place to start.
comment by Ben Pace (Benito) · 2020-06-17T01:16:06.037Z · score: 4 (2 votes) · LW(p) · GW(p)

I'm not actually sure what you mean. I think 'seed AI' means something like 'first case in an iterative/recursive process' of self-improvement, which applies pretty well to the iterated amplification setup (which is a recursively self-improving AI) and lots of other examples that Evan wrote about in his 11-examples post. It still seems to me to be a pretty general term.

comment by G Gordon Worley III (gworley) · 2020-06-17T01:15:35.993Z · score: 2 (1 votes) · LW(p) · GW(p)

Based on comments/links so far it seems I should revise the names and add a fourth:

  • IDA = IDA
  • IRL -> Ambitious Value Learning (AVL)
  • DTA -> Embedded Agency (EA)
  • + Brain Emulation (BE)
    • Build AI that either emulates how humans brains work or is bootstrapped from human brain emulations.


Comments sorted by top scores.

comment by steve2152 · 2020-06-16T23:30:34.356Z · score: 4 (2 votes) · LW(p) · GW(p)

Sorta related is my appendix to this article [LW · GW].

comment by G Gordon Worley III (gworley) · 2020-06-17T01:11:51.170Z · score: 2 (1 votes) · LW(p) · GW(p)

Oh, I forgot about emulation approaches, i.e. bootstrap AI by "copying" human brains, which you mention. Thanks!