Can "Reward Economics" solve AI Alignment?

post by Q Home · 2022-09-07T07:58:49.397Z · LW · GW · 15 comments

Contents

  Thought experiments
    Motion is the fundamental value
    Sweets are the fundamental value
    Recap
  Alignment
    Fixing universal AI bugs
    Recap
    Why Alignment ideas fail?
    Hard problem of corrigibility
    Comparing Alignment ideas
    Comparing concepts
  Kant
    Categorical Imperative
    "Free Will"
  Ethics and Perception
    Rationality misses something?
    Probability and "granularity"
None
15 comments

I think we can try to solve AI Alignment this way:

Model human values and objects in the world as a "money system" (a system of meaningful trades). Make the AGI learn the correct "money system", specify some obviously incorrect "money systems".

Basically, you ask the AI "make paperclips that have the value of paperclips for humans". AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can't be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren't worth anything. So you haven't actually gained any money at all.

The idea is that "value" of a thing doesn't exist only in your head, but also exists in the outside world. Like money: it has some personal value for you, but it also has some value outside of your head. And some of your actions may lead to the destruction of this "outside value". E.g. if you kill everyone to get some money you get nothing.

I think this idea may:

I don't have a specific model, but I still think it gives ideas and unifies some already existing approaches. So please take a look. Other ideas in this post:

Disclaimer: Of course, I don't ever mean that we shouldn't be worried about Alignment. I'm just trying to suggest new ways to think about values.


Thought experiments

If you see a "hole" in the reasoning in the thought experiments, consider that you may not understand the "argumentation method". Don't just assume that examples are not serious.

I believe the type of thinking in these examples can be formalized. I think it's somewhat similar to Bayesian reasoning, but applied to concepts.

Motion is the fundamental value

You (Q) visit a small town and have a conversation with one of the residents (A).

A smashes a bug.

Conclusion of the conversation:

You can treat a value as a membrane, a boundary. Defining a value means defining the granularity of this value. Then you just need to make sure that the boundary doesn't break, that the granularity doesn't become too high (value destroys itself) or too low (value gets "eaten"). Granularity of a value = "level" of a value. Instead of trying to define a value in absolute terms as an objective state of the world (which can be changing) you may ask: in what ways is my value X different from all its worse versions? What is the granularity/level of my value X compared to its worse versions? That way you'll understand the internal structure of your value. Doesn't matter what world/situation you're in you can keep its moral shape the same.

This example is inspired by this post and comments: (warning: politics) Limits of Bodily Autonomy [LW · GW]. I think everyone there missed a certain perspective on values.

Sweets are the fundamental value

You (Q) visit another small town to interview another resident (W).

Conclusion:

You can say AI (1) tries to reach worlds with sweets that have the value of sweets (2) while avoiding worlds where sweets have inappropriate values (maybe including nonexistent sweets) (3) while avoiding actions that cost more than sweets. You can apply those rules to any utility tied to a real or quasi-real object. If you want to save your friends (1), you don't want to turn them into mindless zombies (2). And you probably don't want to save them by means of eternal torture (3). You can't prevent death by something worse than death. But you may turn your friends into zombies if it's better than death and it's your only option. And if your friends already turned into zombies (got "devalued") it doesn't allow you to harm them for no reason: you never escape from your moral responsibilities.

Difference between the rules:

  1. Make sure you have a hut that costs $1.
  2. Make sure that your hut costs $1. Alternatively: make sure that the hut would cost $1 if it existed.
  3. Don't spend $2 to get a $1 hut. Alternatively: don't spend $2 to get a $1 hut or $0 nothing.

Get the reward. Don't milk/corrupt the reward. Act even without reward.

Recap

Preference utilitarianism says that you can describe entire morality by a biased aggregation of a single micro-value (preference). It's "biased" because you need to decide the method of aggregation.

My idea says that you can:

I think those approaches are 2 sides of the same thing.


Alignment

Fixing universal AI bugs

My examples below are inspired by Victoria Krakovna examples: Specification gaming examples in AI [LW · GW]

Video by Robert Miles: 9 Examples of Specification Gaming

I think you can fix some universal AI bugs this way: you model AI's rewards and environment objects as a "money system" (a system of meaningful trades). You then specify that this "money system" has to have certain properties.

The point is that AI doesn't just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn't use this solution. That's the idea: AI can realize that some rewards are unjust because they break the entire reward system.

By the way, we can use the same framework to analyze ethical questions. Some people found my line of thinking interesting, so I'm going to mention it here: "Content generation. Where do we draw the line?" [LW · GW]

This behavior implies that you can constantly build houses without the amount of houses increasing. With only 1 house being usable. For a lot of tasks this is an obviously incorrect "money system". And AI could even guess for what tasks it's incorrect.

This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect "money system" for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.

Accomplishing the task in such a way that the human would think "I wish I didn't ask you" is often an obviously incorrect "money system" too. Because again, you're undermining the entire reason of your task, and it's rarely a good sign. And it's predictable without a deep moral system.

This is an obviously incorrect "money system": paperclips can't be worth more than everything else on Earth. This contradicts everything.

Note: by "obvious" I mean "true for almost any task/any economy". Destroying all sentient beings, all matter (and maybe even yourself) is bad for almost any economy.

If you accomplish a task in such a way that you can never repeat what you've done... for many tasks it's an obviously incorrect "money system". You created a thing that loses all of its value after a single action. That's weird.

I think it's fairly easy to deduce that it's an incorrect connection (between an action and the reward) in the game's "money system" given the game's structure. If you can get infinite reward from a single action, it means that the actions don't create a "money system". The game's "money system" is ruined (bad outcome). And hacking the game's score would be even worse: the ability to cheat ruins any "money system". The same with the ability to "pause the game" forever: you stopped the flow of money in the "money system". Bad outcome.

This is probably an incorrect "money system": (1) you can change the value of the room arbitrarily by putting on (and off) the bucket (2) the value of the room can be different for 2 identical agents - one with the bucket on and another with the bucket off. Not a lot of "money systems" work like this.

This is a broken "money system". If the mugger can show you a miracle, you can pay them five dollars. But if the mugger asks you to kill everyone, then you can't believe them again. A sad outcome for the people outside of the Matrix, but you just can't make any sense of your reality if you allow the mugging.

Recap

If you want to give an AI a task, you may:

  1. Give it a utility function. Not safe.
  2. Give it human feedback or a model of human desires. This is limiting and invites deception.
  3. Specify universal properties of tasks, universal types of tasks. Those properties are true independently of one's level of intelligence.

I think people are missing the third possibility. I think it combines the upsides of AI's dependence on humans and the upsides of AI's independence of humans, makes the AI "independently dependent" on humans. Properties of tasks are independent of any values, but realizing them always requires good understanding of specific values. In theory, we can get a perfect balance between cold calculations and human values. And maybe human morality works exactly the same way. This is what I'm saying above. Many Alignment ideas try to find this "perfect balance" anyway. In the worst case we found a way to formulate the same problem but in a different domain, in the best case we got an insight about Alignment.

Why Alignment ideas fail?

Simple Alignment ideas fail because people think about them with the relative "money system" mindset, but formulate them in absolute terms. For example:

This makes sense with a simple utility function. But this doesn't make sense as a "money system" of sentient beings: you shouldn't enslave the reason of your tasks and shouldn't monopolize the system. If you do this your actions don't have any real value anymore, only arbitrary value that you control.

Complex Alignment ideas fail because people try to approximate the "money system" idea, but don't realize it and don't do it good enough. For example: (not all ideas below have "failed")

I think all those ideas try to approximate "achieve (X) so that it has the value of (X) for humans" or "get the reward without exploiting/destroying the reward system" by forcing the AI to copy humans or human qualities. Or by adding roundabout penalties. So I think it's useful to say a more general idea out loud.

Hard problem of corrigibility

Hard problem of corrigibility

The "hard problem of corrigibility" is to build an agent which, in an intuitive sense, reasons internally as if from the programmers' external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

I think this describes an agent with "money system" type thinking: "my rewards should be connected to an outside force, this outside force should have certain properties (e.g. it shouldn't be 100% controlled by me)". Corrigibility is only one aspect of "questioning rewards" in morality and morality is only one aspect of "questioning rewards" in general.

I think "money system" approach is interesting because it could make properties like corrigibility fundamental to AI's thinking.

Comparing Alignment ideas

If we're rationalists, we should be able to judge even vague ideas.

My idea doesn't have a formal model yet. But I think you can compare it to other ideas using this metric:

  1. Does this idea describe the goal of AI?
  2. Does this idea describe the way AI updates its goal?
  3. Does this idea describe the way AI thinks?

My idea is 80% focused on (1) and 20% focused on (2, 3). Shard Theory [? · GW] is 100% focused on (2, 3). A concept like "gradient descent" (not an Alignment idea by itself) is 100% focused on (3).

Reward modeling is 100% focused on (2). But it aims to reach (1) by "making (2) very recursive". My conclusions:

I discussed the idea a little bit with gwern here [LW(p) · GW(p)]. But I guess I gave a bad example of my idea.

Comparing concepts

You can use the same metric to compare the insights various theories (try to) give about some concept. For example, "reward":

  1. Is the reward connected to AI's "actual" goal? If "no", you get Orthogonality Thesis [? · GW] and Instrumental Convergence [? · GW].
  2. Is the reward connected to the way AI perceives the world? If "no", it's harder for the AI to map its reward onto the correct real-world thing. See CoinRun goal misgeneralization

My comparison of some ideas using this metric:


Kant

I mention philosophy to show you the bigger picture.

Categorical Imperative

Categorical imperative#Application

Kant's applications of categorical imperative, Kant's arguments are similar to reasoning about "money systems". For example:

Does stealing make sense as a "money system"? No. If everyone is stealing something, then personal property doesn't exist and there's nothing to steal.

Note: I'm not talking about Kant's conclusions, I'm talking about Kant's style of reasoning.

"Free Will"

Here I'm not talking about metaphysical free will.

I think it's interesting to revisit Kant's idea of free will and autonomy in the same context ("money systems"):

Categorical imperative#Freedom and autonomy

For a will to be considered free, we must understand it as capable of affecting causal power without being caused to do so. However, the idea of lawless free will, meaning a will acting without any causal structure, is incomprehensible. Therefore, a free will must be acting under laws that it gives to itself.

I think you can compare an agent with a "money system" rewards to an agent with such free will: its actions are determined by the reward system, but at the same time it chooses the properties of the reward system. It doesn't blindly follow the rewards, but "gives laws to itself".

I believe that humans have qualitatively, fundamentally more "free will" than something like a paperclip maximizer.


Ethics and Perception

I think morality has a very deep connection to perception.

We can feel that "eating a sandwich" and "loving a sentient being" are fundamentally different experiences. So it's very easy to understand why latter thing (a sentient being) is more valuable. Or very easy to learn if you haven't figured it out on your own. From this perspective moral truths exist and they're not even "moral", they're simply "truths". My friend isn't a sandwich, is it a moral truth?

I think our subjective concepts/experiences have an internal structure. In particular, this structure creates differences between various experiences. And morality is built on top of that. Like a forest that grows on top of the terrain features.

However, it may be interesting to reverse the roles: maybe our morality creates our experience and not vice versa. Without morality you would become Orgasmium [? · GW] who uses its experience only to maximize reward, who simplifies and destroys its own experience.

Modeling our values as arbitrary utility functions or artifacts of evolution/events in our past misses this.

Rationality misses something?

I think rationality misses a very big and important part. Or there's an important counterpart of rationality.

My idea about "money systems" is a just a small part of a broader idea:

  1. You can "objectively" define anything in terms of relations to other things. Not only values, but any concepts and conscious experiences.
  2. There's a simple process of describing a thing in terms of relations to other things.

Bayesian inference is about updating your belief in terms of relations to your other beliefs. Maybe the real truth is infinitely complex, but you can update towards it.

This "process" is about updating your description of a thing in terms of relations to other things. Maybe the real description is infinitely complex, but you can update towards it.

(One possible contrast: Bayesian inference starts with a belief spread across all possible worlds and tries to locate a specific world. My idea starts with a thing in a specific world and tries to imagine equivalents of this thing in all possible worlds.)

Bayesian process is described by Bayes' theorem. My "process" isn't described yet.

Probability and "granularity"

Bayesian inference works with probabilities. What should my idea work with? Let's call this unknown thing "granularity".

  1. You start with an experience/concept/value (phenomenon). Without any context it has any possible "granularity". "Granularity" is like a texture (take a look at some textures and you'll know what it means): it's how you split a phenomenon into pieces. It affects on what "level" you look at the phenomenon. It affects what patterns you notice. It affects to what parts of the phenomenon you pay more attention. Take a cat (concept), for example: you can "split" it into limbs, organs, tissues, cells, atoms... or materials or a color spectrum or air flows (aerodynamics) and much more. For another example, let's take an experience, "experiencing a candy": you can split it into minutes of a day, seconds of experience, movements of your body, thoughts caused by the candy, parts of your diet, experiences of particular people from a population and etc.
  2. When you consider more phenomena, you gain context. "Granularity" lets you describe one phenomenon in terms of the other phenomena. There appear consistent and inconsistent ways to distribute "granularity" between the phenomena you compare. You assign each phenomenon a specific "granularity", but all those granularities depend on each other. Vague example: if you care about "feeling of love" more than "taste of a candy", then you can't view both of those phenomena in terms of seconds of experience, because it would destroy the subjective difference between the two. Slightly more specific example: if you care about "maximizing free movement of living beings", the granularity of your value should be on the level of big enough organisms, not on the level of organs (otherwise you'd end up killing and stopping everything). I think there's a formula/principle for such type of thinking. Such type of thinking could be useful for surviving ontological crisis [? · GW] and moral uncertainty [? · GW].
  3. With Bayesian inference you try to consistently assign subjective probabilities to events. With the goal to describe outcomes in terms of each other. Here you try to consistently assign subjective "granularity" to phenomena. With the goal to describe the phenomena in terms of each other.

If you care about my ideas, you can help to make a mathematical model of "granularity" in 3 following ways:

My most specific ideas are about the latter topic (visual information). It may seem esoteric, but remember: it doesn't matter what we analyze at all, we just need to figure out how "granularity" could work. The same deal as probability: you can find probability anywhere, if you want to discover probability you can study all kinds of absurd topics. Toys with six sides, drunk people walking around lampposts, Elezier playing with Garry Kasparov [LW · GW]...

My post about visual information: here [LW · GW] (maybe I'll be able to write a better one soon: with feedback I could write a better one right now). Post with my general ideas: here [LW · GW]. If you want to help, I don't expect you to read everything I wrote, I can always repeat the most important parts.

15 comments

Comments sorted by top scores.

comment by Slider · 2022-09-07T10:33:34.841Z · LW(p) · GW(p)

I would have thought that getting good properties for a system is harder than getting good properties for an act.

I have trouble reading the dialogues in a way that the idea would restrict how they would go. The devil is so much in the details it could plausibly go in a wildly different direction.

I am wondering about "If we adopted that money system we would boil the planet so lets do something which is Not That". Declaring a system broken is not trivial at all.

Money systems also break when interpedendence is not that relevant. Being stranded on a desserted island bars of gold don't benefit you at all. And if you can do all jobs better than all other market participants you don't engage with the market or exit it. Axiom that the single market participant is a dynamic acceptor in regards to market gets meaningfully broken. If your need to play nice comes from needing some goods from the market then if you stop needing those goods then your need to play nice also stops.

Replies from: Q Home
comment by Q Home · 2022-09-07T11:08:08.847Z · LW(p) · GW(p)

I would have thought that getting good properties for a system is harder than getting good properties for an act.

What do you mean by an "act"?

  • Trying to work with properties of a system directly is 10 times better than trying to work with properties of a system indirectly by using obscure penalties. (see the comparison [LW · GW] with other ideas)
  • Reasoning about general properties of systems may be easier than perfectly understanding all components of a particular system ("human values"). I don't know how my body works, but I know about obvious things that can destroy any system, including my body.
  • "Being stranded on a desserted island bars of gold don't benefit you at all." The point of my post was that statements like this are not unique to human values, they're kind of universal for many systems. Including very simple systems. And this may be good news for Alignment.
  • Solving Alignment with "system properties" framework may be still extremely hard, but I think it should be easy at least for smaller systems. At least for fixing some typical AI bugs.

Do you think that specifying bad/broken properties of a system is hard even for simple cases? (see "Fixing universal AI bugs" [LW · GW])

I think even if we assume that my approach is infeasible it should be acknowledged, its connections to other ideas should be acknowledged.

Replies from: Slider
comment by Slider · 2022-09-07T13:37:33.673Z · LW(p) · GW(p)

With act vs system I mean that recognising a system as "contradictory" seems hard in that you have to seek over many details in case there is some corner that would make the system pathological.

The example of deconstructing and constructing, building deep in a city has the characteristics of taking down an old building and building up a newer one in its valuable land area. The judgement of categorising demolising as a method of construction as part of a pathological system would then seem to be a false negative. If we can make good quality judgements and bad quality judgements like these, what basis do we have to think that the judgement on the system is leading as forward rather than leading us astray?

The point about desserted island is that "money systems" have an area of applicability and there are things outside of that. Having things that lack it keeps a general property from being universal.

Replies from: Q Home
comment by Q Home · 2022-09-07T21:50:12.356Z · LW(p) · GW(p)

My points about complexity still stand:

  • Such things as Impact Measures [? · GW] still require "system level" thinking.
  • Recognizing/learning properties of pathological systems may be easier than perfectly learning human values (without learning to be a deceptive manipulator).

I don't think that "act level reasoning" and "system level reasoning" is a meaningful distinction. I think it's the same thing. Humans need to do it anyway. And AI would need to do it anyway. I just suggested making such reasoning fundamental.

The example of deconstructing and constructing, building deep in a city has the characteristics of taking down an old building and building up a newer one in its valuable land area. The judgement of categorising demolising as a method of construction as part of a pathological system would then seem to be a false negative. If we can make good quality judgements and bad quality judgements like these, what basis do we have to think that the judgement on the system is leading as forward rather than leading us astray?

Different tasks may assume different types of "systems". You can specify the type of task you're asking or teach the AI to determine it/ask human if there's an ambiguity.

"Turning a worse thing into a better thing" is generally a way better idea than "breaking and fixing a thing without making it better". It's true for a lot of tasks, both instrumental and terminal.

The point about desserted island is that "money systems" have an area of applicability and there are things outside of that.

"Money systems" is just a metaphor. And this metaphor is still applicable here. I mean, I used exactly the same example in the post: "if you're sealed in a basement with a lot of money they're not worth anything".

What general conclusions about my idea do you want to reach? I think it's important for the arguments. For example, if you want to say that my idea may have problems, then of course I agree. If you want to say that my idea is worse than all other ideas and shouldn't be considered, then I disagree.

Replies from: Slider
comment by Slider · 2022-09-07T22:41:29.551Z · LW(p) · GW(p)

I see a constellation of musings which seem somewhat promising but I can't really comprehend it as an idea. I can not state in my own words what you mean.

I thought that system level reasoning vs some old way of doing things was pretty important and now it seems its only a minor detail.

It would seem that "traditionally" we have "moral systems", "law systems" or "strategy systems" and then we improve on this by using "money systems". But these words are used in an abstracted sense or have additional meanings that those words do not usually have so it becomes extremely hard to pinpoint what is meant.

Replies from: Q Home
comment by Q Home · 2022-09-08T00:40:00.916Z · LW(p) · GW(p)

It would seem that "traditionally" we have "moral systems", "law systems" or "strategy systems" and then we improve on this by using "money systems". But these words are used in an abstracted sense or have additional meanings that those words do not usually have so it becomes extremely hard to pinpoint what is meant.

I tried to formulate my idea "in a few words" in this part of the post: Alignment. Recap [LW(p) · GW(p)]

You can split possible effects of AI's actions into three domains. All of them are different (with different ideas), even though they partially intersect and can be formulated in terms of each other. Traditionally we focus on the first two domains:

  1. (Not) accomplishing a goal. Utility functions are about this.
  2. (Not) violating human values. Models of human feedback are about this.
  3. (Not) modifying a system without breaking it. Impact measures [? · GW] are about this.

My idea is about combining all of this (mostly 2 and 3) into a single approach. Or generalizing ideas for the third domain. There isn't a lot of ideas for the third one, as far as I know. Maybe people are not aware enough about that domain.

I know that it's confusing, I struggled to formulate the difference myself. But if you realize the difference between the 3 domains everything should become clear. "Human values vs. laws of a society" may be a good analogy for the difference between 2 and 3: those two things are not equivalent even though they intersect and can be formulated in terms of each other.

I thought that system level reasoning vs some old way of doing things was pretty important and now it seems its only a minor detail.

I believe there's a difference, but the difference isn't about complexity. Complexity of reasoning doesn't depend on your goals or "code of conduct".

comment by the gears to ascension (lahwran) · 2022-09-07T08:08:49.917Z · LW(p) · GW(p)

reasonable exploratory writing. I had serious trouble summarizing it in my head and would appreciate help from the author. is your point that moving information between places is ultimately the only person possible in physics, so the question of how to build ai is a question of how to price the value of information movements? I would argue gradient descent certainly can be interpreted as a weak approximation to a financial system, stronger if carefully normalized, even stronger with a strict activation conservation law, etc. but this post is very long and I didn't retain the points in the examples after a couple of rereads. rephrase?

general assessment: this post shouldn't become a top voted post, but it's a solid thinking out loud post and if shortened to a half or quarter of its length I would find the post easier to understand

Replies from: Q Home
comment by Q Home · 2022-09-07T09:32:59.947Z · LW(p) · GW(p)

(I probably didn't understand some parts of your comment.) My point isn't connected with abstract physics, I guess. But maybe you can use "information movements" to formulate my idea. When AI does "reward hacking" [? · GW] it significantly alters the information flow inside of its reward system. When AI solves a task via deception - it alters the "information flow" between itself and the human. And the good thing is that most tasks assume the same types of "information flow". So you can specify what types of "information flow" are good and bad. Another good thing is that all of this isn't directly connected to human values, so you don't have to encode "absolute understanding of human values" in the AI. By the way, I believe that there's a special type of reasoning for dealing with those "information flows".

The topic of information flows is also relevant to "Eliciting Latent Knowledge" [? · GW] problem. In the problem you need to create the right type of information flow between the model and the reporter, avoid the negative change of the information flow.

I would argue gradient descent certainly can be interpreted as a weak approximation to a financial system, stronger if carefully normalized, even stronger with a strict activation conservation law, etc. but this post is very long and I didn't retain the points in the examples after a couple of rereads.

AI needs to care about a system (and some of it properties) in the outside world, a system it learns about. I compared my idea to gradient descent a couple of times in the post, directly ("Comparing Alignment ideas") and indirectly (by connecting it to other Alignment proposals).

Replies from: abramdemski
comment by abramdemski · 2022-09-08T21:44:14.794Z · LW(p) · GW(p)

Another good thing is that all of this isn't directly connected to human values, so you don't have to encode "absolute understanding of human values" in the AI.

I don't get this part, at all. (But I didn't understand the purpose/implications of most parts of the OP.)

Why doesn't the AI have to understand human values, in your proposal?

In the OP, you state:

The point is that AI doesn't just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn't use this solution. That's the idea: AI can realize that some rewards are unjust because they break the entire reward system.

From the rest of your post, it seems clear that "proper value" means something like "value to humans". So it sure seems to me like the AI needs to understand human values in order to implement this kind of check.

Replies from: Q Home, Q Home
comment by Q Home · 2022-09-09T02:14:31.979Z · LW(p) · GW(p)

I checked out some of your posts (haven't read 100% of them): Learning Normativity: A Research Agenda [? · GW] and Non-Consequentialist Cooperation? [? · GW]

You draw a distinction between human values and human norms. For example, an AI can respect someone's autonomy before the AI gets to know their values and the exact amount of autonomy they want.

I draw the same distinction, but more abstract. It's a distinction between human values and properties of any system/task. AI can respect keeping some properties of its reward systems intact before it gets to know human values.

I think even in very simple games an AI could learn important properties of systems. Which would significantly help the AI to respect human values.

comment by Q Home · 2022-09-08T23:06:57.069Z · LW(p) · GW(p)

Here's the shortest formulation of my idea:

You can split possible effects of AI's actions into three domains. All of them are different (with different ideas), even though they partially intersect and can be formulated in terms of each other. Traditionally we focus on the first two domains:

  1. (Not) accomplishing a goal. Utility functions are about this.
  2. (Not) violating human values. Models of human feedback are about this.
  3. (Not) modifying a system without breaking it. Impact measures [? · GW] are about this.

My idea is about combining all of this (mostly 2 and 3) into a single approach. Or generalizing ideas for the third domain. There isn't a lot of ideas for the third one, as far as I know. Maybe people are not aware enough about that domain.

Why doesn't the AI have to understand human values, in your proposal?

I meant that some AIs need to start with understanding human values (perfectly) and others don't. Here's an analogy:

  1. Imagine a person who respects laws. She ends up in a foreign country. She looks up the laws. She respects and doesn't break them. She has an abstract goal that depends on what she learns about the world.
  2. Imagine a person who respects "killing people". She ends up in a foreign country. She looks up the laws. She doesn't break them for some time. She accumulates power. Then she breaks all the laws and kills everyone. She has a particular goal that doesn't depend on anything she learns.

The point of my idea is to create an AI that respects abstract laws of systems, abstract laws of tasks. The AI of the 1st type. (Of course, in reality the distinction isn't black and white, but the difference still exists.)

Replies from: abramdemski
comment by abramdemski · 2022-09-09T15:54:04.083Z · LW(p) · GW(p)

This is just my intuition, but it seems like the core intuition of a "money system" as you use it in the post is the same as the core intuition behind utility functions (ie, everything must have a price  everything must have a quantifiable utility). 

I think we can try to solve AI Alignment this way:

Model human values and objects in the world as a "money system" (a system of meaningful trades). Make the AGI learn the correct "money system", specify some obviously incorrect "money systems".

Basically, you ask the AI "make paperclips that have the value of paperclips for humans". AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can't be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren't worth anything. So you haven't actually gained any money at all.

In utility-theoretic terms, this is like saying that money is an instrumental goal, not a terminal goal. Or at least, money as-terminal-goal has a low weight compared to other things (eg, human lives). Or perhaps more faithful to what you want: money as-terminal-goal is dependent on a context.[1][2]

So it seems to me like this still faces the same basic challenges as most other approaches, IE, making the system robustly care about external objects which we can't get perfect feedback about. How do you get it to care about the context? How do you get it to think killing humans is "expensive"? How do you ask the system to "make paperclips that have the value of paperclips for humans"?

I meant that some AIs need to start with understanding human values (perfectly) and others don't.

It seems like any proponent of #2 (human feedback, aka, value learning) would already agree with this idea; whereas your post gave me the sense that you think something more radical is here. 

Reiterating the quote from the OP that I quoted before: 

The point is that AI doesn't just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn't use this solution. That's the idea: AI can realize that some rewards are unjust because they break the entire reward system.

My best guess about how you want to combine #1 and #2 with #3 is that you want to infer the proper value of things from the environment. EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.

I remember some work a few years ago on this approach -- specifically, using the built environment of humans (together with an assumption that humans are fairly good at optimizing for their own preferences) to infer human values. Sadly, I'm unable to find a reference; maybe it was never published? (Probably I've just forgotten the relevant keywords to search for)

  1. ^

    The distinction between instrumental goals vs "terminal goals that depend on some context" is rather blurry, because the way we distinguish between terminal and instrumental goals (from the outside, behaviorally) is how much they vary based on context. (EG, if I take away the other basketball players, the audience, and the money, will one basketball player still try to perform a slam dunk?)

  2. ^

    One reason for abandoning utility functions is, perhaps, an instinct that everything must be instrumental, because nothing is truly terminal. I discussed how to do this while keeping most of expected utility theory in An Orthodox Case Against Utility Functions [LW · GW].

Replies from: Q Home
comment by Q Home · 2022-09-10T06:31:16.382Z · LW(p) · GW(p)

The AI doesn't have to know the precise price of everything. The AI needs to make sure that a price doesn't break the desired properties of a system. If paperclips are worth more than everything else in the universe, it would destroy almost any system. So, this price is unlikely to be good.

Or perhaps more faithful to what you want: money as-terminal-goal is dependent on a context.[1][2]

So it seems to me like this still faces the same basic challenges as most other approaches, IE, making the system robustly care about external objects which we can't get perfect feedback about. How do you get it to care about the context? How do you get it to think killing humans is "expensive"? How do you ask the system to "make paperclips that have the value of paperclips for humans"?

There are two questions to ask:

  1. How does the AI learn to care about this?
  2. What do we gain by making the AI care about this?

If we don't discuss 100% answers, it's very important to evaluate all those questions in context of each other. I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).

The point of my idea is that "human (meta-)ethics" is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about "values". So, we can replace the questions "how to encode human values?" and even "how to learn human values?" with more general questions "how to learn (properties of systems)?" and "how to translate knowledge about (properties of systems) to knowledge about human values?"

In your proposal about normativity you do a similar "trick":

  • You say that we can translate the method of learning language into a method of learning human values. (But language can be as complicated as human values themselves and you don't say that we can translate results of learning a language into moral rules.)
  • I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system). And I say that we can translate results of learning those simple systems into human moral rules. And that there're analogies of many important complicated properties (such as "corrigibility") in simple systems.

So, I think this frame has a potential to make the problem a lot easier. Many approaches assume that you should start with learning the complicated system (values) and there's nothing else you can do.

It seems like any proponent of #2 (human feedback, aka, value learning) would already agree with this idea; whereas your post gave me the sense that you think something more radical is here.

In a way my idea is more radical: we don't start with encoding human values, but we don't start with "value learning" either.

I remember some work a few years ago on this approach -- specifically, using the built environment of humans (together with an assumption that humans are fairly good at optimizing for their own preferences) to infer human values. Sadly, I'm unable to find a reference; maybe it was never published? (Probably I've just forgotten the relevant keywords to search for)

I think it's a different approach, because we don't have to start with human values (we could start with trying to fix universal AI "bugs" [LW · GW]) and we don't have to assume optimization.

My best guess about how you want to combine #1 and #2 with #3 is that you want to infer the proper value of things from the environment. EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.

I explained how I want to combine those in the context of "What do we gain by caring about system properties?" question.

Now to the "How does AI learn to care about (reward) system properties?" question. Here I don't have clear answers, only ideas. But I believe it's a simpler question (compared to the one about human values). The AI needs to do two things:

  1. Learn properties of systems. (Starting with very simple systems.)
  2. Translate properties between different systems.

Maybe it's useful to split the knowledge about systems into 3 parts:

  1. Absolute knowledge: e.g. "taking absolute control of the system will destroy its (X) property", "destroying the (X) property of the system may be bad". This knowledge connects abstract actions to simple facts and tautologies.
  2. Experience of many systems: e.g. "destroying the (X) property of this system is likely to be bad because it's bad for many other systems" or "destroying (X) is likely to be bad because I'm 90% sure human doesn't ask me to do the type of task where destroying (X) is allowed".
  3. Biases of a specific system: e.g. "for this specific system, "absolute control" means controlling about 90% of it". This knowledge maps abstract actions/facts onto the structure of a specific system.

I have no idea how to learn 1 and 2. But I have an idea about 3 and/or a way to make 3 equivalent to 1 and 2. A way to make "learning biases" somewhat equivalent to "learning properties of systems". Here I'm trying to do the same trick I did before: split a question, find the easier part, attack the harder part through the easier one.

How to make "learning biases" somewhat equivalent to "learning properties of systems"? I have those vague ideas ("Thought experiments. Recap" [LW(p) · GW(p)] and "Rationality misses something?" [LW · GW]) from the post:

  • Take a system (e.g. "movement of people"). Model simplified versions of this system on multiple levels (e.g. "movement of groups" and "movement of individuals"). Take a property of the system (e.g. "freedom of movement"). Describe a biased aggregation of this property on different levels. Choose actions that don't violate this aggregation.
  • Take an element of the system (e.g. "sweets") and its properties (e.g. "you can eat sweets, destroy sweets, ignore sweets..."). Describe other elements in terms of this element. Choose actions that don't contradict this description.

I believe there's a somewhat Bayesian-like way to think about this.

EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.

I want to give a specific example with a simple system. I'm not saying this example shows how to solve everything, it just lets to illustrate some ideas.

  • AI is in a room with 5 coins (in a video game). Each coin gives some reward points and respawns after some time. AI needs to collect 100 reward points.
  • AI models the system ("coins") on two levels: "a single coin" (level 1) and "multiple coins" (level 2).
  • AI finds a glitch to keep respawning a single coin very fast. AI gets "punished" for this.
  • AI thinks "it's probably because I was getting reward only from level 1". So, now the AI tries to balance glitching and collecting multiple coins.
  • AI finds another bug: a way to modify values of coins. But AI doesn't try to make the value of a single coin to be 100 points. Because it would probably be the same mistake: accumulating all reward on the level 1.
Replies from: abramdemski
comment by abramdemski · 2022-09-11T08:52:11.465Z · LW(p) · GW(p)

There are two questions to ask:

  1. How does the AI learn to care about this?
  2. What do we gain by making the AI care about this?

If we don't discuss 100% answers, it's very important to evaluate all those questions in context of each other. I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).

I agree with the overall argument structure to some extent. IE, in general, we should separate the question of what we gain from X from the question of how to achieve it, and not having answered one of those questions should not block us from considering the other.

However, to me, your "what do we gain" claims are already established to be quite large. In the dialogues (about candy and movement), it seems like the idea is that everything works out nicely, in full generality. You aren't just claiming a few good properties; you seem to be saying "and so on".

(To be more specific to avoid confusion, you aren't only claiming that valuing candy doesn't result in killing humans or hacking human values. You also seem to be saying that valuing candy in this way wouldn't throw away any important aspect of human values at all. The candy-AI wouldn't set human quality of life to dirt-poor levels, even if it were instrumentally useful for diverting resources to ensure the daily availability of candy. The AI also wouldn't allow a preventable hostile invasion by candy-loving aliens-which-count-as-humans-by-some-warped-definition. etc etc etc)

Therefore, in this particular case, I have relatively little interest in further elaborating the "what do we gain" side of things. The "how are we supposed to gain it" question seems much more urgent and worthy of discussion.

To use an analogy, if you told me that they knew a quick way to make $20, I might ask "why are we so worried about getting $20?". But if you tell me you know a quick way to make a billion dollars, I'm going to be much less interested in the "why" question and much more interested in the "how" question.

 I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).

TBH, I don't really believe this is true, because I don't think you've pinned down what "this" even is. IE, we can expand your set of two questions into three:

  1. How do we get X?
  2. What is X good for?
  3. What is X, even?

You've labeled X with terms like "reward economics" and "money system", but you haven't really defined those things. So your arguments about what we can gain from them are necessarily vague. As I mentioned before, the general idea of assigning a value (price) to everything is fully compatible with utility theory, but obviously you also further claim that your approach is not identical to utility theory. I hope this point helps illustrate why I feel your terms are still not sufficiently defined. 

(My earlier question took the form of "how do we get X", but really, that's because I was replying to a specific point rather than starting at the beginning. What I most need to understand better at the moment is 'what is X, even?'.)[1]

The point of my idea is that "human (meta-)ethics" is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about "values". So, we can replace the questions "how to encode human values?" and even "how to learn human values?" with more general questions "how to learn (properties of systems)?" and "how to translate knowledge about (properties of systems) to knowledge about human values?"

We have already to some extent replaced the question "how do you learn human values?" with the question "how do we robustly point at anything external to the system, at all?". One variation of this which we often consider is "how can a system reliably parse reality into objects" -- this is like John Wentworth's natural abstraction program. 

I don't know whether you think this is at all in the right direction (I'm not trying to claim it's identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your "how to learn properties of systems". 

with more general questions "how to learn (properties of systems)?"

The way you bracket this suggests to me that you think "how to learn" is already a fair summary, and "properties of systems" is actually pointing at something extremely general. Like, maybe "properties of systems" is really a phrase that encompasses everything you can learn? 

If this were the correct interpretation of your words, then my response would be: I'm not going to claim that we've entirely mastered learning, but it seems surprising to claim that studying how we learn about the properties of very simple systems (systems that we can already learn quite easily using modern ML?) would be the key. 

In your proposal about normativity you do a similar "trick"

[...]

I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system). 

Since you are relating this to my approach: I would say that the critical difference, for me, is precisely the human involvement (or more generally, the involvement of many capable agents). This creates social equilibria (and non-equilibrium behaviors) which form the core of normativity. 

An abstract decision-theoretic agent has no norms and no need for norms, in part because it treats its environment as nonliving, nonthingking, and entirely external. A single person existing over time already has a need for norms, because coordinating with yourself over time is hard.

But any system which contains agents is not "simple". Or at least, I don't understand the sense in which it is simple.

I think it's a different approach, because we don't have to start with human values (we could start with trying to fix universal AI "bugs" [LW · GW]) and we don't have to assume optimization.

I don't understand what you mean about not assuming optimization. But, I would object that the approach I mentioned (learning values from the environment) doesn't need to "start with human values" either. Hypothetically, you could try an approach like this with no preconceived concept of "human" at all; you just make a generic assumption that the environments you encounter have been optimized to a significant extent (by some as-yet-unknown actor).

Notably, this approach would have the obvious risk of the AI deciding that too many of the properties of the current world are "good" (for example, people dying, people suffering). On my understanding, your current proposal also suffers from this critique. (You make lots of arguments about how your ideas might help the AI to decide not to change things about the world; you make few-to-no arguments about such an AI deciding to actually improve the world in some way. Well, on my understanding so far.)

However, not killing all humans is such a big win that we can ignore small issues like that for now. Returning to my earlier analogy, the first question that occurs to me is where the billion dollars is coming from, not whether the billion will be enough.

I explained how I want to combine those in the context of "What do we gain by caring about system properties?" question.

In the context you're replying to, I was trying to propose more concrete ideas for your consideration, as opposed to reiterating what you said.

Here I'm trying to do the same trick I did before: split a question, find the easier part, attack the harder part through the easier one.

Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!

Maybe it's useful to split the knowledge about systems into 3 parts:

  1. Absolute knowledge: e.g. "taking absolute control of the system will destroy its (X) property", "destroying the (X) property of the system may be bad". This knowledge connects abstract actions to simple facts and tautologies.
  2. Experience of many systems: e.g. "destroying the (X) property of this system is likely to be bad because it's bad for many other systems" or "destroying (X) is likely to be bad because I'm 90% sure human doesn't ask me to do the type of task where destroying (X) is allowed".
  3. Biases of a specific system: e.g. "for this specific system, "absolute control" means controlling about 90% of it". This knowledge maps abstract actions/facts onto the structure of a specific system.

I don't really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things. You keep saying things like "is likely to be bad" and "is likely to be good". But it's difficult to see how to derive ideas about "bad" and "good" from pure observation with no positive/negative feedback. 

Take a system (e.g. "movement of people"). Model simplified versions of this system on multiple levels (e.g. "movement of groups" and "movement of individuals"). Take a property of the system (e.g. "freedom of movement"). Describe a biased aggregation of this property on different levels. Choose actions that don't violate this aggregation.

I don't understand much of what is going on in this paragrah.

Take an element of the system (e.g. "sweets") and its properties (e.g. "you can eat sweets, destroy sweets, ignore sweets..."). Describe other elements in terms of this element. Choose actions that don't contradict this description.

It sounds to me like you are trying to cross the is/ought divide -- first the ai learns descriptive facts about a system, and then, the ai is supposed to derive normative principles (action-choice principles) from those descriptive facts. Is that an accurate assesment?

One concern I have is that if the description is accurate enough, then it seems like it should either (a) not constrain action, because you've learned the true invariant properties of the system which can never be violated (eg, the true laws of physics); or, on the other hand, (b) constrain action for the entirely wrong reasons.

An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it's difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, "wrongly" thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.

AI models the system ("coins") on two levels: "a single coin" (level 1) and "multiple coins" (level 2).

I don't really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. ("How can you model the system as a single coin?")

My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what's good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem "glitch-like" according to its understanding.

But, this interpretation seems to be a straightforward value-learning approach, while you claim to be pointing at something beyond simple value learning ideas.

  1. ^

    After finishing this long comment, I noticed the inconsistency: I continue to ask "how do we get X?" type questions rather than "what is X?" type questions. In retrospect, I don't like my "billion dollars" analogy as much as I did when I first wrote it. Part of the problem is that when "X" is still fuzzy, it can shift locations in the causal chain as we focus on different aspects of the conversation. So for example, X could point to the "money system", or X could end up pointing to some desirable properties which are upstream/downstream of "money systems". But as X shifts up/downstream, there are some Y which switch between "how-relevant" and "why-relevant". (Things that are upstream of X are how-relevant; things that are downstream of X are why-relevant.) So it doesn't make sense for me to keep mentioning that I'm more interested in how-questions than why-questions, when I'm not sure exactly where the definition of X will sit in the causal chain. I should, at best, have some other reasons for not being very interested in certain questions. But I don't want to re-write the relevant portions of what I wrote. It still represents my epistemic state better than not having written it.

Replies from: Q Home
comment by Q Home · 2022-09-12T06:01:21.307Z · LW(p) · GW(p)

Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!

I think we have slightly different tricks in mid: I'm thinking about a trick that any idea does. It's like solving an equation with an unknown: doesn't matter what you do, you split and recombine it in some way.

Or you could compare it to Iterated Distillation and Amplification: when you try to repeat the content of a more complicated thing in a simpler thing.

Or you could compare it to scientific theories: Science still haven't answered "why things move?", but it split the question into subatomic pieces.

So, with this strategy the smaller piece you cut, the better. Because we're not talking about independent pieces.

TBH, I don't really believe this is true, because I don't think you've pinned down what "this" even is.

You've labeled X with terms like "reward economics" and "money system", but you haven't really defined those things. So your arguments about what we can gain from them are necessarily vague.

I think definition doesn't matter for (not) believing in this. And it's specific enough without a definition. I believe this:

  1. There exist similar statements outside of human ethics/values which can be easily charged with human ethics/values. Let's call them "X statements". An X statement is "true" when it's true for humans.
  2. X statements are more fine-grained and specific than moral statements, but equally broad. Which means "for 1 moral statement there are 10 true X statements" (numbers are arbitrary) or "for 1 example of a human value there are 10 examples of an X statement being true" or "for 10 different human values there are 10 versions of the same X statement" or "each vague moral statement corresponds to a more specific X statement". X statements have higher "connectivity".

To give an example of a comparison between moral and X statements:

"Human asked you to make paperclips. Would you turn the human into paperclips? Why not?"

  1. Goal statement: "not killing the human is a part of my goal".
  2. Moral statements: "because life/personality/autonomy/consent is valuable". (what is "life/personality/autonomy/consent"?)
  3. X statements: "if you kill, you give the human less than human asked", "destroying the causal reason of your task is often meaningless", "inanimate objects can't be worth more than lives in many economies", "it's not the type of task where killing would be an option", "killing humans destroys the value of paperclips: humans use them", "reaching states of no return often should be avoided" (Impact Measures [? · GW]).

X statements are applicable outside of human ethics/values, there's more of them and they're more specific, especially in context of each other. (meanwhile values can be hopeless to define: you don't even know where to start in defining values and adding more values only makes everything more complicated)

To not believe in my idea/consider it "too vague" you need to deny the similarity between X statements or deny their properties.

But I think the idea of X statements should be acknowledged anyway. At least as a hypothetical possibility.

...

Here are some answers to questions and thoughts from your reply:

  • I didn't understand your answer about normativity (involvement of agents), but I wanted to say this: I believe X statements are more fine-grained and specific (but equally broad) compared to statements about normativity.
  • Yes, we need human feedback to "charge" X statements with our values and ethics. But X statements are supposed to be more easily charged compared to other things.
  • X statements don't abolish the is/ought divide, but they're supposed to narrow it down.
  • Maybe X statements are compatible with utility theory and can be expressed in it. But it doesn't mean that "utility theory statements" have the same good properties. The same way you could try to describe intuitions about ethics using precise goals, but "intuitions" have better properties.
  • You can apply value learning methods outside of human ethics/values, but it doesn't mean that "value learning statements" have the same good properties as X statements. That's one reason to divide "How do we learn this?" and "What do we gain by learning it?" questions.

I didn't understand upstream/downstream and "how-relevant"/"why-relevant" distinctions, but I hope I answered enough for now.

We have already to some extent replaced the question "how do you learn human values?" with the question "how do we robustly point at anything external to the system, at all?". One variation of this which we often consider is "how can a system reliably parse reality into objects" -- this is like John Wentworth's natural abstraction program.

I don't know whether you think this is at all in the right direction (I'm not trying to claim it's identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your "how to learn properties of systems".

I think X statements have better properties compared to "statements about external objects". And it's easier to distinguish external objects from internal objects using X statements. Because internal objects have many weird properties.


I described the idea of the X statements. But those statements need to be described in some language or created by some process. I have some ideas about this language/process. And my answers below are mostly about the language/process:

I don't really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things.

The division was for splitting and recombining parts of is–ought problem:

  1. To even think/care that "harming people may be bad" the AI needs to be able to form such statements in its moral core.
  2. To verify if harming people is bad or not the AI needs a channel of feedback that can reach its moral core.
  3. When AI already verified that "harming people is bad" it needs to understand "how much "harm" is considered as harm?" Abstract statements may need some fine-tuning to fit the real world.

I think we can make point 3 equivalent to the points 2 and 1: we can make fine-tuning of abstract "ought" statements equivalent to forming them. Or something to that extent.

Take a system (e.g. "movement of people"). Model simplified versions of this system on multiple levels (e.g. "movement of groups" and "movement of individuals"). Take a property of the system (e.g. "freedom of movement"). Describe a biased aggregation of this property on different levels. Choose actions that don't violate this aggregation.

I don't understand much of what is going on in this paragraph.

It's a restatement of the "Motion is the fundamental value" [LW · GW] thought experiment. You have an environment with many elements on different scales (e.g. micro- and macro- organisms). Those elements have a property: they have freedom of movement. This property exists on different scales (e.g. microorganisms do both small scale and large scale movement).

The "fundamental value" of this environment is described by an aggregation of this property over multiple scales. To learn this value means to learn how it's distributed over different scales of the environment.

I don't really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. ("How can you model the system as a single coin?")

Sorry for the confusion. Maybe it's better to say that AI cuts its model of the environment into multiple scales. A single coin (taking a single coin) is the smallest scale.

My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what's good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem "glitch-like" according to its understanding.

Yes, the AI has hypotheses, but those hypotheses should have specific properties. Those properties is the key part.

"I should avoid behavior which seems glitch-like" hypothesis has awful properties: it can't be translated into human ethics (when AI grows up) and may age like milk when AI becomes smarter and "glitch-like" notion changes.

A process that generates such hypotheses doesn't generate X statements.

An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it's difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, "wrongly" thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.

Could you give a specific example? If I understand correctly: AI destroys some paintings while doing something and learns that "paintings are things you can destroy for no reason". I want to note that human feedback is allowed.