Should AI learn human values, human norms or something else?

post by Q Home · 2022-09-17T06:19:16.482Z · LW · GW · 1 comments

Contents

  Do X statements exist?
  X statements in Alignment field
  Recap
  Languages
None
2 comments

In this post I want to say that there exists an interesting way to approach Alignment. Beware, my argument is a little abstract.

If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there's more types, but I know only those three:

  1. Statements about specific states of the world, specific actions. (Atomic statements)
  2. Statements about values. (Value statements)
  3. Statements about general properties of systems and tasks. (X statements) Because you can describe values of humanity as a system and "helping humans" as a task.

Any of those types can describe unaligned values. So, any type of those statements still needs to be "charged" with values of humanity. I call a statement "true" if it's true for humans.

We need to find the statement type with the best properties. Then we need to (1) find a "language" for this type of statements (2) encode some true statements and/or describe a method of finding true statements. If we succeed, we solve the Alignment problem.

I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.

I want to show the difference between the statement types. Imagine we ask an Aligned AI: "if human asked you to make paperclips, would you kill the human? Why not?" Possible answers with different statement types:

  1. Atomic statements: "it's not the state of the world I want to reach", "it's not the action I want to do".
  2. Value statements: "because life, personality, autonomy and consent is valuable".
  3. X statements: "if you kill, you give the human less than human asked, less than nothing: it doesn't make sense for any task", "destroying the causal reason of your task (human) is often meaningless", "inanimate objects can't be worth more than lives in many trade systems", "it's not the type of task where killing would be an option", "killing humans makes paperclips useless since humans use them: making useless stuff is unlikely to be the task", "reaching states of no return should be avoided in many tasks" (Impact Measures [? · GW]).

X statements have those better properties compared to other statement types:

I want to give an example of the last point:

X statements more easily become stronger connected in a specific context (compared to value statements).

Do X statements exist?

I can't formalize human values, but I believe values exist. The same way I believe X statements exist, even though I can't define them.

I think the existence of X statements is even harder to deny than the existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.

If you believe in X statements and their good properties, then you're rationally obliged to think how you could formalize them and incorporate them into your research agenda.

X statements in Alignment field

X statements are almost entirely ignored in the field (I believe), but not completely ignored.

Impact measures [? · GW] ("affecting the world too much is bad", "taking too much control is bad") are X statements. But they're a very specific subtype of X statements.

Normativity [? · GW] (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They're too similar to value statements.

Contractualist ethics [? · GW] (by Tan Zhi Xuan) are based on X statements. But contractualism uses a specific subtype of X statements (describing "roles" of people). And contractualism doesn't investigate many interesting properties of X statements.

The properties of X statements is the whole point. You need to try to exploit those properties to the maximum. If you ignore those properties then the abstraction of "X statements" doesn't make sense. And the whole endeavor of going beyond "value statements/value learning" loses effectiveness.

Recap

Basically, my point boils down to this:

Languages

We need a "language" to formalize statements of a certain type.

Atomic statements are usually described in the language of Utility Functions [? · GW].

Value statements are usually described in the language of some learning process (Value Learning [? · GW]).

X statements don't have a language yet, but I have some ideas about it. Thinking about typical AI bugs ("Specification gaming examples in AI" [LW · GW]) should be able to inspire some ideas about the language.

You also could help me to come up us with ideas about the language by discussing some thought experiments [LW · GW] with me. I have a bigger post about X statements: Can "Reward Economics" solve AI Alignment? [LW · GW]

1 comments

Comments sorted by top scores.

comment by [deleted] · 2022-09-18T07:03:54.498Z · LW(p) · GW(p)Replies from: Q Home
comment by Q Home · 2022-10-23T23:30:04.371Z · LW(p) · GW(p)

Thank you, Tan Zhi Xuan's work is very related. Added [LW · GW] the mention of it to the post. Also added a summarization [LW(p) · GW(p)] of my idea.