Metalignment: Deconfusing metaethics for AI alignment.

post by Guillaume Corlouer (Tancrede) · 2019-08-23T10:25:38.756Z · LW · GW · 7 comments

Contents

  What do I mean by metaethics? 
  Why could this be important? 
  Some examples
  Possible objections
  Conclusion
None
7 comments

Epistemic status: MSFP blog post day. General and very speculative ideas.

Proposition : Deconfusing metaethics might be a promising way to increase our chances of solving AI alignment.

What do I mean by metaethics?

Metaethics here is understood as an ideal procedure that humans are approximating when they reason about ethics i.e. when they are trying to build ethical theories. Let's have a look at mathematics for an analogy. Part of the mathematical production involves using some theory of logic to prove or disprove some conjecture about some mathematical object. Theorems, lemmas and properties that one can derive from axioms working with some logic is, roughly, part of how mathematics progresses. Another analogy is how we learn about regularities in the world by approximating Solomonoff induction. It seems that we are lacking some formalised, ideal rational procedure of ethical progress that would help us with sorting and generating ethical theories. Such a procedure seems difficult to figure out and potentially crucial to help solving AI alignment.

Why could this be important?

A better understanding of metaethics could help us decide among different ethical theories and how to generate new ones. Furthermore, knowing what the world should become and how AI should interact with it might requires us to make progress on how we should think about ethics to enlighten how we could think about aligning AI. For example, aligning AI with human values, learning and aggregating human preferences in some way, avoiding X-risks are all ethical propositions of what we should do. It is plausible that these views are flawed and that a better understanding of how to think about ethics might make us reconsider these normative stances and clarify what alignment means.

The following intuition is one of the main reasons why I think a better understanding of metaethics might be important to AI alignment research. As I am thinking more about ethics, arguing with others about it and getting more informed about the world, my ethical views evolve and it seems that I am making some sort of progress by sharpening my reasons for why I hold some ethical view or why some ethical theory seems flawed. Thus I tend to value more my future self's moral views to the extent that he has spent more time thinking about ethics and is more informed about the world so that I trust him more about deciding how I should go about transforming it. Similarly, it might be sensible for future AI systems to be able to instantiate a similar process of moral progress to update its utility function or goals according to the results of such a process that, if transparent and consulted by humans, could figure out how to transform the world through some long and efficient ethical reflection.

Some examples

For clarification, the following, non-exhaustive, criteria might be examples of how to evaluate ethical theories and constraints under which we could generate new ones.

Possible objections

This approach of AI alignment might be too top-down in its current formulation and raise a number of difficult challenges or objections toward being a research path worth pursuing :

Nevertheless such a project might have the positive aspect of not speeding up AI capability research while informing us about values and how to think about alignment. One important downside though would be that there might be other more promising projects to pursue instead.

Conclusion

To conclude I would like to suggest some possible way to imagine working toward a better understanding of metaethics and producing better ethical theories. These are extremely broad and vague suggestions to stimulate research ideas.

.

7 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2019-08-23T17:24:28.155Z · LW(p) · GW(p)

Metaethics here is understood as an ideal procedure that humans are approximating when they reason about ethics i.e. when they are trying to build ethical theories.

That would make sense except that "metaethics" already has a different meaning in academic philosophy, namely studying what morality itself is. (See my Six Plausible Meta-Ethical Alternatives for a really quick intro to the main metaethical positions that I think are plausible.)

What you're calling "metaethics" here corresponds better to what philosophers call metaphilosophy. I've been pushing the importance of researching metaphilosophy in the context of AI alignment for a while, so it's nice to see someone reach similar conclusions independently. :) If you're interested in my thoughts on the topic, see Some Thoughts on Metaphilosophy [LW · GW] and the posts that it links to.

Another line of thinking that's related is CEV.

(I'll probably come back and give some more detailed feedback on the rest of the content, but just wanted to fire off these quick notes for now.)

Replies from: Tancrede
comment by Guillaume Corlouer (Tancrede) · 2019-08-31T13:57:42.043Z · LW(p) · GW(p)

Thanks for all the useful links! I'm also always happy to receive more feedback.

I agree that the sense in which I use metaethics in this post is different from what academic philosophers usually call metaethics. I have the impression that metaethics, in academic sense, and metaphilosophy are somehow related. Studying what morality itself is, how to select ethical theories and what is the process behind ethical reasoning seems not independent. For example if moral nihilism is more plausible then it seems to be less likely that there is some meaningful feedback loop to select ethical theories or that there is such a meaningful thing as a ‘good’ ethical theory (at least in an observer independent way) . If moral emotivism is more plausible then maybe reflecting on ethics is more like emotions rationalisation, e.g. typically expressing in a sophisticated way something that just fundamentally means ‘boo suffering’. In that case having better understanding of metaethics in the academic sense seems to bring some light to a process that generates ethical theories, at least in humans.

comment by Gordon Seidoh Worley (gworley) · 2019-08-30T21:32:40.637Z · LW(p) · GW(p)

Like Wei, I'm similarly in favor of research in this direction. I suspect we need, for example, an adequate theory of human values so that we can construct and, more importantly, verify aligned AI, but right now we are so confused about human values I'm not sure we could even tell if an AI was aligned or not.

I have a lot of developing thoughts in this area that have moved beyond what I was thinking the last time I tried to write up my thinking in this area a couple years ago. I'm not sure what I'll find time for in the coming months or if I'll solidify my ideas enough for them to be in a shareable state, but happy to talk more if you're interested in pursuing this direction.

Replies from: Tancrede
comment by Guillaume Corlouer (Tancrede) · 2019-08-31T13:33:52.142Z · LW(p) · GW(p)

Sure, I'm happy to read/discuss your ideas about this topic.

comment by Teerth Aloke · 2019-08-23T15:05:57.538Z · LW(p) · GW(p)

Essentially, your first suggestion is doing computer aided analysis on ethical theories, and proving theorems under them. Right?

Replies from: Tancrede
comment by Guillaume Corlouer (Tancrede) · 2019-08-23T18:36:22.642Z · LW(p) · GW(p)

I am not sure about what computer aided analysis mean but one possibility could be to have formal ethical theories and prove some theorem inside their formal framework. But this raises questions about the sort of formal framework that one could use to 'prove theorems' under ethics in a meaningful way.

Replies from: Teerth Aloke
comment by Teerth Aloke · 2019-08-24T16:58:23.514Z · LW(p) · GW(p)

Till this point, I have heard the idea of an ethics axiomatic system several times. But, no suggestion of what such axioms could be. Computer aided analysis in the sense of an automated theorem checker to search for contradictions in the system.