What are some exercises for building/generating intuitions about key disagreements in AI alignment?

post by riceissa · 2020-03-16T07:41:58.775Z · score: 17 (6 votes) · LW · GW · No comments

This is a question post.


    4 romeostevensit
No comments

I am interested in having my own opinion about more of the key disagreements [LW · GW] within the AI alignment field, such as whether there is a basin of attraction for corrigibility, whether there is a theory of rationality that is sufficiently precise to build hierarchies of abstraction [LW(p) · GW(p)], and to what extent there will be a competence gap.

In "Is That Your True Rejection?" [LW · GW], Eliezer Yudkowsky wrote:

I suspect that, in general, if two rationalists set out to resolve a disagreement that persisted past the first exchange, they should expect to find that the true sources of the disagreement are either hard to communicate, or hard to expose. E.g.:

  • Uncommon, but well-supported, scientific knowledge or math;
  • Long inferential distances;
  • Hard-to-verbalize intuitions, perhaps stemming from specific visualizations;
  • Zeitgeists inherited from a profession (that may have good reason for it);
  • Patterns perceptually recognized from experience;
  • Sheer habits of thought;
  • Emotional commitments to believing in a particular outcome;
  • Fear that a past mistake could be disproved;
  • Deep self-deception for the sake of pride or other personal benefits.

I am assuming that something like this is happening in the key disagreements in AI alignment. The last three bullet points are somewhat uncharitable to proponents of a particular view, and also seem less likely to me. Summarizing the first six bullet points, I want to say something like: some combination of "innate intuitions" and "life experiences" led e.g. Eliezer and Paul Christiano to arrive at different opinions. I want to go through a useful subset of the "life experiences" part, so that I can share some of the same intuitions.

To that end, my question is something like: What fields should I learn? What textbooks/textbook chapters/papers/articles should I read? What historical examples (from history of AI/ML or from the world at large) should I spend time thinking about? (The more specific the resource, the better.) What intuitions should I expect to build by going through this resource? In the question title I am using the word "exercise" pretty broadly.

If you believe one just needs to be born with one set of intuitions rather than another, and that there are no resources I can consume to refine my intuitions, then my question is instead more like: How can I better introspect so as to find out which side I am on :)?

Some ideas I am aware of:


answer by romeostevensit · 2020-03-16T08:06:50.423Z · score: 4 (2 votes) · LW(p) · GW(p)

If you want to investigate the intuitions themselves e.g. what is generating differing intuitions between researchers, I'd pay attention to which metaphors are being used in the reference class tennis when reading the existing debates.

comment by riceissa · 2020-03-16T23:52:35.883Z · score: 3 (2 votes) · LW(p) · GW(p)

I have only a very vague idea of what you mean. Could you give an example of how one would do this?

No comments

Comments sorted by top scores.