Should AI learn human values, human norms or something else?
post by Q Home · 2022-09-17T06:19:16.482Z · LW · GW · 1 commentsContents
Do X statements exist? X statements in Alignment field Recap Languages None 2 comments
In this post I want to say that there exists an interesting way to approach Alignment. Beware, my argument is a little abstract.
If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there's more types, but I know only those three:
- Statements about specific states of the world, specific actions. (Atomic statements)
- Statements about values. (Value statements)
- Statements about general properties of systems and tasks. (X statements) Because you can describe values of humanity as a system and "helping humans" as a task.
Any of those types can describe unaligned values. So, any type of those statements still needs to be "charged" with values of humanity. I call a statement "true" if it's true for humans.
We need to find the statement type with the best properties. Then we need to (1) find a "language" for this type of statements (2) encode some true statements and/or describe a method of finding true statements. If we succeed, we solve the Alignment problem.
I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.
I want to show the difference between the statement types. Imagine we ask an Aligned AI: "if human asked you to make paperclips, would you kill the human? Why not?" Possible answers with different statement types:
- Atomic statements: "it's not the state of the world I want to reach", "it's not the action I want to do".
- Value statements: "because life, personality, autonomy and consent is valuable".
- X statements: "if you kill, you give the human less than human asked, less than nothing: it doesn't make sense for any task", "destroying the causal reason of your task (human) is often meaningless", "inanimate objects can't be worth more than lives in many trade systems", "it's not the type of task where killing would be an option", "killing humans makes paperclips useless since humans use them: making useless stuff is unlikely to be the task", "reaching states of no return should be avoided in many tasks" (Impact Measures [? · GW]).
X statements have those better properties compared to other statement types:
- X statements have more "density". They give you more reasons to not do a bad thing. For comparison, atomic statements always give you only one single reason.
- X statements are more specific, but equally broad compared to value statements.
- Many X statements not about human values can be translated/transferred into statements about human values. (It's valuable for learning, see Transfer learning.)
- X statements allow to describe something universal for all levels of intelligence. For example, they don't exclude smart and unexpected ways to solve a problem, but they exclude harmful and meaningless ways.
- X statements are very recursive: one statement can easily take another (or itself) as an argument. X statements more easily clarify and justify each other compared to value statements.
I want to give an example of the last point:
- Value statements recursion: "(preserving personality) weakly implies (preserving consent); (preserving consent) even more weakly implies (preserving personality)", "(preserving personality) somewhat implies (preserving life); (preserving life) very weakly implies (preserving personality)".
- X statements recursion: "(not giving the human less than the human asked) implies (not doing a task in a meaningless way); (not doing a task in a meaningless way) implies (not giving the human less than the human asked)", "(not doing a task in a meaningless way) implies (not destroying the reason of your task); (not ignoring the reason of your task) implies (not doing a task in a meaningless way)".
X statements more easily become stronger connected in a specific context (compared to value statements).
Do X statements exist?
I can't formalize human values, but I believe values exist. The same way I believe X statements exist, even though I can't define them.
I think the existence of X statements is even harder to deny than the existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.
If you believe in X statements and their good properties, then you're rationally obliged to think how you could formalize them and incorporate them into your research agenda.
X statements in Alignment field
X statements are almost entirely ignored in the field (I believe), but not completely ignored.
Impact measures [? · GW] ("affecting the world too much is bad", "taking too much control is bad") are X statements. But they're a very specific subtype of X statements.
Normativity [? · GW] (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They're too similar to value statements.
Contractualist ethics [? · GW] (by Tan Zhi Xuan) are based on X statements. But contractualism uses a specific subtype of X statements (describing "roles" of people). And contractualism doesn't investigate many interesting properties of X statements.
The properties of X statements is the whole point. You need to try to exploit those properties to the maximum. If you ignore those properties then the abstraction of "X statements" doesn't make sense. And the whole endeavor of going beyond "value statements/value learning" loses effectiveness.
Recap
Basically, my point boils down to this:
- Maybe true X statements is a better learning goal than true value statements.
- X statements can be thought of as a more convenient refreaming of human values. This reframing can make learning easier. It reveals some convenient properties of human values. We need to learn some type of "X statements" anyway, so why not take those properties into account?
Languages
We need a "language" to formalize statements of a certain type.
Atomic statements are usually described in the language of Utility Functions [? · GW].
Value statements are usually described in the language of some learning process (Value Learning [? · GW]).
X statements don't have a language yet, but I have some ideas about it. Thinking about typical AI bugs ("Specification gaming examples in AI" [LW · GW]) should be able to inspire some ideas about the language.
You also could help me to come up us with ideas about the language by discussing some thought experiments [LW · GW] with me. I have a bigger post about X statements: Can "Reward Economics" solve AI Alignment? [LW · GW]
1 comments
Comments sorted by top scores.