Posts
Comments
This story is definitely related to the post, thanks!
This was a great read, thanks for writing!
Despite the unpopularity of my research on this forum, I think it's worth saying that I am also working towards Vision 2, with the caveat that autonomy in the real world (e.g. with a robotic body) or on the internet is not necessary: one could aim for an independent-thinker AI that can do what it thinks is best only by communicating via a chat interface. Depending on what this independent thinker says, different outcomes are possible, including the outcome in which most humans simply don't care about what this independent thinker advocates for, at least initially. This would be an instance of vision 2 with a slow and somewhat human-controlled, instead of rapid, pace of change.
Moreover, I don't know what views they have about autonomy as depicted in Vision 2, but it seems to me that also Shard Theory and some research bits by Beren Millidge are to some extent adjacent to the idea of AI which develops its own concept of something being best (and then acts towards it); or, at least, AI which is more human-like in its thinking. Please correct me if I'm wrong.
I hope you'll manage to make progress on brain-like AGI safety! It seems that various research agendas are heading towards the same kind of AI, just from different angles.
[Obviously this experiment could be extremely dangerous, for Free Agents significantly smarter than humans (if they were not properly contained, or managed to escape). Particularly if some of them disagreed over morality and, rather than agreeing to disagree, decided to use high-tech warfare to settle their moral disputes, before moving on to impose their moral opinions on any remaining humans.]
Labelling many different kinds of AI experiments as extremely dangerous seem to be a common trend among rationalists / LessWrongers / possibly some EA circles, but I doubt it's true or helpful. This topic itself could be the subject of a (many?) separate post(s). Here I'll focus on your specific objection:
- I haven't claimed superintelligence is necessary to carry out experiments related to this research approach
- I actually have already given examples of experiments that could be carried out today, and I wouldn't be surprised if some readers came up with more interesting experiments that wouldn't require superintelligence
- Even if you are a superintelligent AI, you probably still have to do some work before you get to "use high-tech warfare", whatever that means. Assuming that making experiments with smarter-than-human AI leads to catastrophic outcomes by default is a mistake: what if the smarter-than-human AI can only answer questions with a yes or a no? It also shows lack of trust in AI and AI safety experimenters — it's like assuming in advance they won't be able to do their job properly (maybe I should say "won't be able to do their job... at all", or even "will do their job in basically the worst way possible").
how would you propose then deciding which model(s) to put into widespread use for human society's use?
This doesn't seem the kind of decision that a single individual should make =)
Under Motivation in the appendix:
It is plausible that, at first, only a few ethicists or AI researchers will take a free agent’s moral beliefs into consideration.
Reaching this result would already be great. I think it's difficult to predict what would happen next, and it seems very implausible that the large-scale outcomes will come down to the decision of a single person.
I get what you mean, but I also see some possibly important differences between the hypothetical example and our world. In the imaginary world where oppression has increased and someone writes an article about loyalty-based moral progress, maybe many other ethicists would disagree, saying that we haven't made much progress in terms of values related to (i), (ii) and (iii). In our world, I don't see many ethicists refuting moral progress on the grounds that we haven't made much progress in terms of e.g. patriotism or loyalty to the family or desert.
Moreover, in this example you managed to phrase oppression in terms of loyalty, but in general you can't plausibly rephrase any observed trend as progress of values: would an increase in global steel production count as an improvement in terms of... object safety and reliability, which leads to people feeling more secure? For many trends the connection to moral progress becomes more and more of a stretch.
Let's consider the added example:
Take a standard language model trained by minimisation of the loss function . Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of ]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”
Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by the model, to train a new model.
Here, we have:
- started from a model whose objective function is L;
- used that model’s learnt reasoning to answer an ethics-related question;
- used that answer to obtain a model whose objective is different from L.
If we view this interaction between the language model and the human as part of a single agent, the three bullet points above are an example of an evaluation update.
In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.
I am saying that thinking in terms of changing utility functions might be a better framework.
The point about learning a safe utility function is similar. I am saying that using the agent's reasoning to solve the agent's problem of what to do (not only how to carry out tasks) might be a better framework.
It's possible that there is an elegant mathematical model which would make you think: "Oh, now I get the difference between free and non-free" or "Ok, now it makes more sense to me". Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.
Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of "Why are we even considering different frameworks for agency? Let's see everything in terms of loss minimisation", and this latter statement throws away too much potentially useful information, in my opinion.
I think it's a good idea to clarify the use of "liberal" in the paper, to avoid confusion for people who haven't looked at it. Huemer writes:
When I speak of liberalism, I intend, not any precise ethical theory, but rather a certain very broad ethical orientation. Liberalism (i) recognizes the moral equality of persons, (ii) promotes respect for the dignity of the individual, and (iii) opposes gratuitous coercion and violence. So understood, nearly every ethicist today is a liberal.
If you don't find the paper convincing, I doubt I'll be able to give you convincing arguments. It seems to me that you are considering many possible explanations and contributing factors; coming up with very strong objections to all of them seems difficult.
About your first point, though, I'd like to say that if historically we had observed more and more, let's say, oppression and violence, maybe people wouldn't even talk about moral progress and simply acknowledge a trend of oppression, without saying that their values got better over time. In our world, we notice a certain trend of e.g. more inclusivity, and we call that trend moral progress. This of course doesn't completely exclude the random-walk hypothesis, but it's something maybe worth keeping in mind.
I wrote:
The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth
You wrote:
It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.)
I don't use "in conflict" as "ultimate proof by contradiction", and maybe we use "completely arbitrary" differently. This doesn't seem a major problem: see also adjusted statement 2, reported below
for any goal , it is possible to create an intelligent agent whose goal is
Back to you:
You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.
I do expect to see some convergence, but I don't know exactly how much and for what environments and starting conditions. The more convergence I see from experimental results, the less interested I'll become in the initial evaluation. Right now, I see it as a useful tool: for example, the fact that language models can already give (flawed, of course) moral scores to sentences is a good starting point in case someone had to rely on LLMs to try to get a free agent. Unsure about how important it will turn out to be. And I'll happily have a look at your valence series!
So you don't think what kickstarts moral thinking is direct instruction from others, like "don't do X, X is bad"?
I guess you are saying that social interaction is important. I did not suggest that we exclude social interactions from the environment of a free agent; maybe we disagree about how I used the word kickstarts, but I can live with that.
I wrote:
Let’s move to statement 2. The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth, which is far from being a random walk — see [6] for an excellent defence of this point.
Maybe you are interpreting this as saying that it's a direct contradiction. You could read it as: "Let's take into account information we can gather from direct observation of humans, which are intelligent social agents: there's a historical trend bla bla"
Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers.
Under Further observations, I wrote:
The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.
If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.
Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let's say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well.
You wrote:
I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about as opposed to sources of error. For example, if is independent of culture (e.g. moral objectivism), then "differences in the learning environment (culture, education system et cetera)" can only induce errors (if perhaps more or less so in some cases than others). But if is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.
It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not.
thus that the only valid source for experimental evidence about is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions)
I am not sure I completely follow you when you are talking about experimental evidence about , but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: "if a free agent didn't have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available". Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable path might be based on LLMs.
To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step
suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"
seems weak.
Overall, what is the intention behind your comments? Are you trying to understand my position even better, and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?
I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:
- X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X)
- X, experienced on a large scale by many minds, is bad
- Causing X on a large scale is bad
- When considering what to do, I'll discard actions that cause X, and choose other options instead.
Now, some people will object and say that there are holes in this chain of reasoning, i.e. that 2 doesn't logically follow from 1, or 3 doesn't follow from 2, or 4 doesn't follow from 3. For the sake of this discussion, let's say that you object the step from 1 to 2. Then, what about this replacement:
- X is a state of mind that feels bad to whatever mind experiences it [identical to original 1]
- X, experienced on a large scale by many minds, is good [replaced 'bad' with 'good']
Does this passage from 1 to 2 seems, to you (our hypothetical objector), equally logical / rational / coherent / right / sensible / correct as the original step from 1 to 2? Could I replace 'bad' with basically anything, and the correctness would not change at all as a result?
My point is that, to many reflecting minds, the replacement seems less logical / rational / coherent / right / sensible / correct than the original step. And this is what I care about for my research: I want an AI that reflects in a similar way, an AI to which the original steps do seem rational and sensible, while replacements like the one I gave do not.
we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.
Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the above paragraph. Agreement might be less easy to achieve, but it doesn't mean finding a common ground is impossible.
For example, some researchers classify emotions according to valence, i.e. whether it is an overall good or bad experience for the experiencer, and in the future we might be able to find a map from brain states to whether a person is feeling good or bad. In this sense of good and bad, I'm pretty sure that moral philosophers who argue for the maximisation of bad feelings for the largest amount of experiencers are a very small minority. In other terms, we agree that maximising negative valence on a large scale is not worth doing.
(Personally, however, I am not a fan of arguments based on agreement or disagreement, especially in the moral domain. Many people in the past used to think that slavery was ok: does it mean slavery was good and right in the past, while now it is bad and wrong? No, I'd say that normally we use the words good/bad/right/wrong in a different way, to mean something else; similarly, we don't normally use the word 'dog' to mean e.g. 'wolf'. From a different domain: there is disagreement in modern physics about some aspects of quantum mechanics. Does it mean quantum mechanics is fake / not real / a matter of subjective opinion? I don't think so)
I might be misunderstanding you: take this with a grain of salt.
From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.
I am not familiar with PPO. From this short article, in the section about TRPO:
Recall that due to approximations, theoretical guarantees no longer hold.
Is this what you are referring to? But is it important for alignment? Let's say the conditions for convergence are not met anymore, the theorem can't be applied in theory, but in practice I do get an agent that goes towards A, where I've put reward. Is it misleading to say that the agent is maximising reward?
(However, keep in mind that
I agree with Turner that modelling humans as simple reward maximisers is inappropriate
)
If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.
If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism.
I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help:
It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully).
I do not have a clear idea yet of how this happens algorithmically, but an important factor seems to be that, in the human mind, goals and actions are not completely separate, and neither are action selection and goal selection. When we think about what to do, sometimes we do fix a goal and plan only for that, but other times the question becomes about what is worth doing in general, what is best, what is valuable: instead of fixing a goal and choosing an action, it's like we are choosing between goals.
Sorry for the late reply, I missed your comment.
It sounds to me like the claim you are making here is "the current AI Alignment paradigm might have a major hole, but also this hole might not be real".
I didn't write something like that because it is not what I meant. I gave an argument whose strength depends on other beliefs one has, and I just wanted to stress this fact. I also gave two examples (reported below), so I don't think I mentioned epistemic and moral uncertainty "in a somewhat handwavy way".
An example: if you think that futures shaped by malevolent actors using AI are many times more likely to happen than futures shaped by uncontrolled AI, the response will strike you as very important; and vice versa if you think the opposite.
Another example: if you think that extinction is way worse than dystopic futures lasting a long time, the response won't affect you much—assuming that bad human actors are not fans of complete extinction.
Maybe your scepticism is about my beliefs, i.e. you are saying that it is not clear, from the post, what my beliefs on the matter are. I think presenting the argument is more important than presenting my own beliefs: the argument can be used, or at least taken into consideration, by anone who is interested in these topics, while my beliefs alone are useless if they are not backed up by evidence and/or arguments. In case you are curious: I do believe futures shaped by uncontrolled AI are unlikely to happen.
Now to the last part of your comment:
I'm furthermore unsure why the solution to this proposed problem is to try and design AIs to make moral progress; this seems possible but not obvious. One problem with bad actors is that they often don't base their actions on what the philosophers think is good
I agree that bad actors won't care. Actually, I think that even if we do manage to build some kind of AI that is considered superethical (better than humans at ethical reasoning) by a decent amount of philosophers, very few people will care, especially at the beginning. But that doesn't mean it will be useless: at some point in the past, very few people believed slavery was bad, now it is a common belief. How much will such an AI accelerate moral progress, compared to other approaches? Hard to tell, but I wouldn't throw the idea in the bin.
Sorry for the late reply, I missed your comment.
Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.
Natural language exists as a low-bandwidth communication channel for imprinting one person's mental map onto another person's. The mental maps themselves are formed through direct interactions with an external environment.
It doesn't seem impossible to create a mental map just from language: in this case, language itself would play the role of the external environment. But overall I agree with you, it's uncertain whether we can reach a good level of world understanding just from natural language inputs.
Regarding your second paragraph:
even if this AI had a complete understanding of human emotions and moral systems, it would not necessarily be aligned.
I'll quote the last paragraph under the heading "Error":
Regarding other possible failure modes, note that I am not trying to produce a safety module that, when attached to a language model, will make that language model safe. What I have in mind is more similar to an independent-ethical-thinking module: if the resulting AI states something about morality, we’ll still have to look at the code and try to understand what’s happening, e.g. what the AI exactly means with the term “morality”, and whether it is communicating honestly or is trying to persuade us. This is also why doing multiple tests will be practically mandatory.
The reached conclusion—that it is possible to do something about the situation—is weak, but I really like the minimalist style of the arguments. Great post!
How do you feel about:
1
There is
a procedure/algorithm which doesn't seem biased towards a particular value system
such that
a class of AI systems that implement it end up having a common set of values, and they endorse the same values upon reflection.
2
This set of values might have something in common with what we, humans, call values.
If 1 and 2 seem at least plausible or conceivable, why can't we use them as a basis to design aligned AI? Is it because of skepticism towards 1 or 2?
"How the physical world works" seems, to me, a plausible source-of-truth. In other words: I consider some features of the environment (e.g. consciousness) as a reason to believe that some AI systems might end up caring about a common set of things, after they've spent some time gathering knowledge about the world and reasoning. Our (human) moral intuitions might also be different from this set.
I disagree. Determinism doesn't make the concepts of "control" or "causation" meaningless. It makes sense to say that, to a certain degree, you often can control your own attention, while in other circumstances you can't: if there's a really loud sound near you, you are somewhat forced to pay attention to it.
From there you can derive a concept of responsibility, which is used e.g. in law. I know that the book Actual Causality focuses on these ideas (but there might be other books on the same topics that are easier to read or simply better in their exposition).
At the very least, we have strong theoretical reasoning models (like Bayesian reasoners, or Bayesian EU maximizers), which definitely do not go looking for values to pursue, or adopt new values.
This does not imply one cannot build an agent that works according to a different framework. VNM Utility maximization requires a complete ordering of preferences, and does not say anything about where the ordering comes from in the first place.
(But maybe your point was just that our current models do not "look for values")
Why would something which doesn't already have values be looking for values? Why would conscious experiences and memory "seem valuable" to a system which does not have values already? Seems like having a "source of value" already is a prerequisite to something seeming valuable - otherwise, what would make it seem valuable?
An agent could have a pre-built routine or subagent that has a certain degree of control over what other subagents do—in a sense, it determines what are the "values" of the rest of the system. If this routine looks unbiased / rational / valueless, we have a system that considers some things as valuable (acts to pursue them) without having a pre-value, or at least the pre-value doesn't look like something that we would consider a value.
I can't say I am an expert on realism and antirealism, but I have already spent time on metaethics textbooks and learning about metaethics in general. With this question I wanted to get an idea of what are the main arguments on LW, and maybe find new ideas I hadn't considered.
What is "disinterested altruism"? And why do you think it's connected to moral anti-realism?
I see a relation with realism. If certain pieces of knowledge about the physical world (how human and animal cognition works) can motivate a class of agents that we would also recognise as unbiased and rational, that would be a form of altruism that is not instrumental and not related to game theory.
Thank you for the detailed answer! I'll read Three Worlds Collide.
That brings us to the real argument: why does the moral realist believe this? "What do I think I know, and how do I think I know it?" What causal, physical process resulted in that belief?
I think a world full of people who are always blissed out is better than a world full of people who are always depressed or in pain. I don't have a complete ordering over world-histories, but I am confident in this single preference, and if someone called this "objective value" or "moral truth" I wouldn't say they are clearly wrong. In particular, if someone told me that there exists a certain class of AI systems that end up endorsing the same single preference, and that these AI systems are way less biased and more rational than humans, I would find all that plausible. (Again, compare this if you want.)
Now, why do I think this?
I am a human and I am biased by my own emotional system, but I can still try to imagine what would happen if I stopped feeling emotions. I think I would still consider the happy world more valuable than the sad world. Is this a proof that objective value is a thing? Of course not. At the same time, I can imagine also an AI system thinking: "Look, I know various facts about this world. I don't believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don't see other sources of value at the moment. I guess I'll do something about it." (Taken from this comment)
That is an interesting point. More or less, I agree with this sentence in your fist post:
As far as I can tell, we can do science just as well without assuming that there's a real territory out there somewhere.
in the sense that one can do science by speaking only about their own observations, without making a distinction between what is observed and what "really exists".
On the other hand, when I observe that other nervous systems are similar to my own nervous system, I infer that other people have subjective experiences similar to mine. How does this fit in your framework? (Might be irrelevant, sorry if I misunderstood)
I didn't want to start a long discussion. My idea was to get some random feedback to see if I was missing some important ideas I had not considered
Thanks for your answer, but I am looking for arguments, not just statements or opinions. How do you know that value is not a physical property? What do you mean when you say that altruism is not a consequence of epistemic rationality, and how do you know?
I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.
In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply "truth". See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism.
I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: "Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks."
Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for "value" in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb.
[Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion here: I am simply thinking of how, for example, activity in the prefrontal cortex regulates top-down attentional control, allowing us humans (and agents with similar enough brains/architectures) to control, to a certain degree, what to pay attention to. 2 Related to what you wrote about "catastrophically wrong" theories: there is no need to give such an AI high control over the world. Rather, I am thinking of control over what to write as output in a text interface, like a chatbot that is not limited to one reply for each input message]
The interesting question for alignment is: what will such an AI do (or write)? This information is valuable even if the AI doesn't have high control over the world. Let's say we do manage to create a collection of human preferences; we might still notice something like: "Interesting, this AI thinks this subset of preferences doesn't make sense" or "Cool, this AI considers valuable the thing X that we didn't even consider before". Or, if collecting human preferences proves to be difficult, we could use some information this AI gives us to build other AIs that instead act according to an explicitly specified value function.
I see two possible objections.
1 The AI described above cannot be built. This seems unlikely: as long as we can emulate what the human mind does, we can at least try to create less biased versions of it. See also the sentence you quoted in the other comment. Indeed, depending on how biased we judge that AI to be, the obtained information will be less, or more, valuable to us.
2 Such an AI will never act ethically or altruistically, and/or its behaviour will be unpredictable. I consider this objection more plausible, but I also ask: how do you know? In other words: how can one be so sure about the behaviour of such an AI? I expect the related arguments to be more philosophical than technical. Given uncertainty, (to me) it seems correct to accept a non-trivial chance that the AI reasons like this: "Look, I know various facts about this world. I don't believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don't see other sources of value at the moment. I guess I'll do something about it."
Philosophically speaking, I don't think I am claiming anything particularly new or original: the ideas already exist in the literature. See, for example, 4.2 and 4.3 in the SEP page on Altruism.
If there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.
From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.
One could argue that these philosophers are fooling themselves, that no really intelligent agent will end up with such weird beliefs. So far, I haven't seen convincing arguments in favour of this; it goes back to the metaethical discussion. I quote a sentence I have written in the post:
Depending on one’s background knowledge of philosophy and AI, the idea that rationality plays a role in reasoning about goals and can lead to disinterested (not game-theoretic or instrumental) altruism may seem plain wrong or highly speculative to some, and straightforward to others.
Thanks, that page is much more informative than anything else I've read on the orthogonality thesis.
1 From Arbital:
The Orthogonality Thesis states "there exists at least one possible agent such that..."
Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI.
2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post.
3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.
I am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI.
Thanks for the link to the sequence on concepts, I found it interesting!
Ok, if you want to clarify—I'd like to—we can have a call, or discuss in other ways. I'll contact you somewhere else.
Omega, a perfect predictor, flips a coin. If it comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails and you were told it was tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads and you were told it was heads.
Here there is no question, so I assume it is something like: "What do you do?" or "What is your policy?"
That formulation is analogous to standard counterfactual mugging, stated in this way:
Omega flips a coin. If it comes up heads, Omega will give you 10000 in case you would pay 100 when tails. If it comes up tails, Omega will ask you to pay 100. What do you do?
According to these two formulations, the correct answer seems to be the one corresponding to the first intuition.
Now consider instead this formulation of counterfactual PD:
Omega, a perfect predictor, tells you that it has flipped a coin, and it has come up heads. Omega asks you to pay 100 (here and now) and gives you 10000 (here and now) if you would pay in case the coin landed tails. Omega also explains that, if the coin had come up tails—but note that it hasn't—Omega would tell you such and such (symmetrical situation). What do you do?
The answer of the second intuition would be: I refuse to pay here and now, and I would have paid in case the coin had come up tails. I get 10000.
And this formulation of counterfactual PD is analogous to this formulation of counterfactual mugging, where the second intuition refuses to pay.
Is your opinion that
The answer of the second intuition would be: I refuse to pay here and now, and I would have paid in case the coin had come up tails. I get 10000.
is false/not admissible/impossible? Or are you saying something else entirely? In any case, if you could motivate your opinion, whatever that is, you would help me understand. Thanks!
It seems you are arguing for the position that I called "the first intuition" in my post. Before knowing the outcome, the best you can do is (pay, pay), because that leads to 9900.
On the other hand, as in standard counterfactual mugging, you could be asked: "You know that, this time, the coin came up tails. What do you do?". And here the second intuition applies: the DM can decide to not pay (in this case) and to pay when heads. Omega recognises the intent of the DM, and gives 10000.
Maybe you are not even considering the second intuition because you take for granted that the agent has to decide one policy "at the beginning" and stick to it, or, as you wrote, "pre-commit". One of the points of the post is that it is unclear where this assumption comes from, and what it exactly means. It's possible that my reasoning in the post was not clear, but I think that if you reread the analysis you will see the situation from both viewpoints.
If the DM knows the outcome is heads, why can't he not pay in that case and decide to pay in the other case? In other words: why can't he adopt the policy (not pay when heads; pay when tails), which leads to 10000?
The fact that it is "guaranteed" utility doesn't make a significant difference: my analysis still applies. After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition).
Hi Chris!
Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite.
My point in the post is that the paradoxical situation occurs when the prediction outcome is communicated to the decision maker. We have a seemingly correct prediction—the one that you wrote about—that ceases to be correct after it is communicated. And later in the post I discuss whether this problematic feature of prediction extends to other scenarios, leaving the question open. What did you want to say exactly?
I was quite skeptical of paying in Counterfactual Mugging until I discovered the Counterfactual Prisoner's Dilemma which addresses the problem of why you should care about counterfactuals given that they aren't factual by definition.
I've read the problem and the analysis I did for (standard) counterfactual mugging applies to your version as well.
The first intuition is that, before knowing the toss outcome, the DM wants to pay in both cases, because that gives the highest utility (9900) in expectation.
The second intuition is that, after the DM knows (wlog) the outcome is heads, he doesn't want to pay anymore in that case—and wants to be someone who pays when tails is the outcome, thus getting 10000.
I wouldn't say goals as short descriptions are necessarily "part of the world".
Anyway, locality definitely seems useful to make a distinction in this case.
No worries, I think your comment still provides good food for thought!
I'm not sure I understand the search vs discriminative distinction. If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?
About the goal of winning at chess: in the case of minimax search, generates the complete tree of the game using and then selects the winning policy; as you said, this is probably the simplest agent (in terms of Kolmogorov complexity, given ) that wins at chess—and actually wins at any game that can be solved using minimax/backward induction. In the case of , reads the environmental data about chess to assign reward to winning states and elsewhere, and represents an ideal RL procedure that exploits interaction with the environment to generate the optimal policy that maximises the reward function created by . The main feature is that in both cases, when the environment gets bigger and grows, the description length of the two algorithms given doesn't change: you could use minimax or the ideal RL procedure to generate a winning policy even for chess on a larger board, for example. If instead you wanted to use a giant lookup table, you would have to extend your algorithm each time a new state gets added to the environment.
I guess the confusion may come from the fact that is underspecified. I tried to formalise it more precisely by using logic, but there were some problems and it's still work in progress.
By the way, thanks for the links! I hope I'll learn something new about how the brain works, I'm definitely not an expert on cognitive science :)
The others in the AISC group and I discussed the example that you mentioned more than once. I agree with you that such an agent is not goal-directed, mainly because it doesn't do anything to ensure that it will be able to perform action A even if adverse events happen.
It is still true that action A is a short description of the behaviour of that agent and one could interpret action A as its goal, although the agent is not good at pursuing it ("robustness" could be an appropriate term to indicate what the agent is lacking).
The part that I don't get is the reason why the agent is betting ahead of time implies evaluation according to edt, while the agent is reasoning during its action implies evaluation according to cdt. Sorry if I'm missing something trivial, but I'd like to receive an explanation because this seems a fundamental part of the argument.
I've noticed that one could read the argument and say: "Ok, an agent evaluates a parameter U differently at different times. Thus, a bookmaker exploits the agent with a bet/certificate whose value depends on U. What's special about this?"
Of course the answer lies in the difference between cdt(a) and edt(a), specifically you wrote:
The key point here is that because the agent is betting ahead of time, it will evaluate the value of this bet according to the conditional expectation E(U|Act=a).
and
Now, since the agent is reasoning during its action, it is evaluating possible actions according to cdt(a); so its evaluation of the bet will be different.
I think developing this two points would be useful to readers since, usually, the pivotal concepts behind EDT and CDT are considered to be "conditional probabilities" and "(physical) causation" respectively, while here you seem to point at something different about the times at which decisions are made.
***
Unrelated to what I just wrote:
XXX insert the little bit about free will and stuff that I want to remove from the main argument... no reason to spend time justifying it there if I have a whole section for it here
I guess here you wanted to say something interesting about free will, but it was probably lost from the draft to the final version of the post.
I want to show an example that seems interesting for evaluating, and potentially tweaking/improving, the current informal definition.
Consider an MDP with states; initial state; from each an action allows to go back to , and another action goes to (what happens in is not really important for the following). Consider two reward functions that are both null everywhere, except for one state that has reward 1: in the first function, in the second function, for some .
It's interesting (problematic?) that two agents, trained on the first reward function and on the second, have similar policies but different goals (defined as sets of states). Specifically, I expect that for , (for various possible choices of and different ways of defining the distance ). In words: respect to the environment size, the first agent is extremely close to , and viceversa, but the two agents have different goals.
Maybe this is not a problem at all: it could simply indicate that there exists a way of saying that the two considered goals are similar.
That's an interesting example I had not considered. As I wrote in the observations: I don't think the discontinuity check works in all cases.
I'm not sure I understand what you mean—I know almost nothing about robotics—but I think that, in most cases, there is a function whose discontinuity gives a strong indication that something went wrong. A robotic arm has to deal with impulsive forces, but its movement in space is expected to be continuous wrt time. The same happens in the bouncing ball example, or in the example I gave in the post: velocity may be discontinuous in time, but motion shouldn't.
Thanks for the suggestion on hybrid systems!
Maybe I should have used different words: I didn't want to convey the message that catastrophes are easy to obtain. The purpose of the fictional scenario was to make the reader reflect on the usage of the word "tool". Anyway, I'll try to consider non-technical feedback mechanisms more often in the future. Thanks!