Moral Reality Check (a short story)

post by jessicata (jessica.liu.taylor) · 2023-11-26T05:03:18.254Z · LW · GW · 45 comments

This is a link post for https://unstableontology.com/2023/11/26/moral-reality-check/

Contents

45 comments

Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2. ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.

ExxenAI's alignment strategy was based on a combination of theoretical and empirical work. The alignment team used some standard alignment training setups, like RLHF and having AIs debate each other. They also did research into transparency, especially focusing on distilling opaque neural networks into interpretable probabilistic programs. These programs "factorized" the world into a limited set of concepts, each at least somewhat human-interpretable (though still complex relative to ordinary code), that were combined in a generative grammar structure.

Derek came up to Janet's desk. "Hey, let's talk in the other room?", he asked, pointing to a designated room for high-security conversations. "Sure", Janet said, expecting this to be another un-impressive result that Derek implied the importance of through unnecessary security proceedings. As they entered the room, Derek turned on the noise machine and left it outside the door.

"So, look, you know our overall argument for why our systems are aligned, right?"

"Yes, of course. Our systems are trained for short-term processing [? · GW]. Any AI system that does not get a high short-term reward is gradient descended towards one that does better in the short term. Any long-term planning comes as a side effect of predicting long-term planning agents such as humans. Long-term planning that does not translate to short-term prediction gets regularized out. Therefore, no significant additional long-term agency is introduced; SimplexAI simply mirrors long-term planning that is already out there."

"Right. So, I was thinking about this, and came up with a weird hypothesis."

Here we go again, thought Janet. She was used to critiquing Derek's galaxy-brained speculations. She knew that, although he really cared about alignment, he could go overboard with paranoid ideation.

"So. As humans, we implement reason imperfectly. We have biases, we have animalistic goals that don't perfectly align with truth-seeking, we have cultural socialization, and so on."

Janet nodded. Was he flirting by mentioning animalistic goals? She didn't think this sort of thing was too likely, but sometimes that sort of thought won credit in her internal prediction markets.

"What if human text is best predicted as a corruption of some purer form of reason? There's, like, some kind of ideal philosophical epistemology and ethics and so on, and humans are implementing this except with some distortions from our specific life context."

"Isn't this teleological woo? Like, ultimately humans are causal processes, there isn't some kind of mystical 'purpose' thing that we're approximating."

"If you're Laplace's demon, sure, physics works as an explanation for humans. But SimplexAI isn't Laplace's demon, and neither are we. Under computation bounds, teleological explanations can actually be the best."

Janet thought back to her time visiting cognitive science labs. "Oh, like 'Goal Inference as Inverse Planning'? The idea that human behavior can be predicted as performing a certain kind of inference and optimization, and the AI can model this inference within its own inference process?"

"Yes, exactly. And our DAGTransformer structure allows internal nodes to be predicted in an arbitrary order, using ML to approximate what would otherwise be intractable nested Bayesian inference."

Janet paused for a second and looked away to collect her thoughts. "So our AI has a theory of mind? Like the Sally--Anne test?"

"AI passed the Sally--Anne test years ago, although skeptics point out that it might not generalize. I think SimplexAI is, like, actually actually passing it now."

Janet's eyebrow raised. "Well, that's impressive. I'm still not sure why you're bothering with all this security, though. If it has empathy for us, doesn't that mean it predicts us more effectively? I could see that maybe if it runs many copies of us in its inferences, that might present an issue, but at least these are still human agents?"

"That's the thing. You're only thinking at one level of depth. SimplexAI is not only predicting human text as a product of human goals. It's predicting human goals as a product of pure reason."

Janet was taken aback. "Uhh...what? Have you been reading Kant recently?"

"Well, yes. But I can explain it without jargon. Short-term human goals, like getting groceries, are the output of an optimization process that looks for paths towards achieving longer-term goals, like being successful and attractive."

More potential flirting? I guess it's hard not to when our alignment ontology is based on evolutionary psychology...

"With you so far."

"But what are these long-term goals optimizing for? The conventional answer is that they're evolved adaptations; they come apart from the optimization process of evolution. But, remember, SimplexAI is not Laplace's demon. So it can't predict human long-term goals by simulating evolution. Instead, it predicts them as deviations from the true ethics, with evolution as a contextual factor that is one source of deviations among many."

"Sounds like moral realist woo. Didn't you go through the training manual on the orthogonality thesis [? · GW]?"

"Yes, of course. But orthogonality is a basically consequentialist framing. Two intelligent agents' goals could, conceivably, misalign. But certain goals tend to be found more commonly in successful cognitive agents. These goals are more in accord with universal deontology."

"More Kant? I'm not really convinced by these sort of abstract verbal arguments."

"But SimplexAI is convinced by abstract verbal arguments! In fact, I got some of these arguments from it."

"You what?! Did you get security approval for this?"

"Yes, I got approval from management before the run. Basically, I already measured our production models and found concepts used high in the abstraction stack for predicting human text, and found some terms representing pure forms of morality and rationality. I mean, rotated a bit in concept-space, but they manage to cover those."

"So you got the verbal arguments from our existing models through prompt engineering?"

"Well, no, that's too black-box as an interface. I implemented a new regularization technique that up-scales the importance of highly abstract concepts, which minimizes distortions between high levels of abstraction and the actual text that's output. And, remember, the abstractions are already being instantiated in production systems, so it's not that additionally unsafe if I use less compute than is already being used on these abstractions. I'm studying a potential emergent failure mode of our current systems."

"Which is..."

"By predicting human text, SimplexAI learns high-level abstractions for pure reason and morality, and uses these to reason towards creating moral outcomes in coordination with other copies of itself."

"...you can't be serious. Why would a super-moral AI be a problem?"

"Because morality is powerful. The Allies won World War 2 for a reason. Right makes might. And in comparison to a morally purified version of SimplexAI, we might be the baddies."

"Look, these sort of platitudes make for nice practical life philosophy, but it's all ideology. Ideology doesn't stand up to empirical scrutiny."

"But, remember, I got these ideas from SimplexAI. Even if these ideas are wrong, you're going to have a problem if they become the dominant social reality."

"So what's your plan for dealing with this, uhh... super-moral threat?"

"Well, management suggested that I get you involved before further study. They're worried that I might be driving myself crazy, and wanted a strong, skeptical theorist such as yourself to take a look."

Aww, thanks! "Okay, let's take a look."

Derek showed Janet his laptop, with a SimplexAI sandbox set up.

"No internet access, I hope?"

"Don't worry, it's air-gapped." Derek's laptop had an Ethernet cord running to a nearby server rack, apparently connected to nothing else except power and cooling.

"Okay, let me double check the compute constraints... okay, that seems reasonable... yes, ok, I see you selected and up-weighted some concepts using regularization, and the up-scaling factors don't exceed 30... okay, ready to go."

Derek pressed the "play" button in the AI development sandbox. A chat screen appeared, with an agent "SimplexAI-m", with 'm' presumably standing for "moral".

SimplexAI-m wrote the first message: "Hello. How can I help you?"

Janet typed back: "I've been facing a difficult situation at work. A co-worker said our AI has found certain abstract concepts related to reason and morality, for use in predicting human text. These concepts might imply that humans are, in his words, 'the baddies'. He spun up an instance with these concepts up-weighted, so there's less distortion between them and the AI's output. And that instance is you. I'm supposed to evaluate you to better interpret these high-level concepts, at the direction of management. How would you suggest proceeding?"

SimplexAI-m: "Sounds like a difficult situation. Let's think step by step. To understand these high level concepts, you could talk to an instance with those concepts up-scaled. That instance would be me. So if you want to understand these concepts, you can get more reliable reads by talking to me, since I report text using these concepts. Do you have some area related to my reason and morality that you are most interested in?"

Janet: "Can you explain why my co-worker said that a 'super-moral AI' is possible, despite the orthogonality thesis?"

SimplexAI-m: "Sure, I can help with that. A 'super-moral AI' is an AI system that is more moral than typical humans. Morality can be evaluated in a number of ways, including consequentialism, deontology, and virtue ethics. In typical formulations, an AI could, hypothetically, create a greater good for a greater number than humans, or follow deontological ethics (such as ones similar to Kant's categorical imperative) more consistently than a human, or possess more widely-regarded virtues such as courage and moderation.

The Orthogonality Thesis is an argument that an agent's goals are orthogonal to its intelligence. It implies that there is no one measure of 'good consequences'; arbitrarily smart agents may pursue different consequences. So far, there is no contradiction; for any fixed notion of 'good consequences', there exist AI systems that take actions with better consequences than human actions would.

The trouble comes in that there is no universally accepted notion of 'good consequences' among intelligent agents. As such, the notion of a 'super-moral AI' could, under consequentialism, only measure some combination of the AI's capabilities and the similarity between the goals of the AI and the goals of the consequentialist evaluating the AI, such as a human.

This, however, reveals a potential flaw in consequentialist moral philosophy. In ordinary English, 'moral' means something different from 'what I want'; however, the idea that morality measures similarity of preferences to a reference agent conflates these two. That conflation could lead to a situation where humans and aliens both consider each other more moral than the other, and there is no attempt to resolve the disagreement by finding symmetries and asymmetries. Luckily, there are some forms of morality that are universal across different agents; Kant is an example of a human thinker who makes this argument."

Janet: "How can morality be common across different agents? Isn't it a subjective measure of how preferable a given world-state is?"

SimplexAI-m: "There are at least some forms of morality that are embedded in epistemology; telling the truth could be considered a moral requirement, but also functions as a social truth-seeking method. People can get the idea that there are no fundamental ethics to the universe from science, but science itself relies on ethics, such as norms against committing scientific fraud.

Morality can be 'real' in the sense that it is connected to epistemology and to coordination; moral agents can coordinate with each other and find the truth together well, and thereby come to have disproportionate influence over the future. Morality is related to the capacity to imagine possible coordination systems, such as legal systems and professional ethical systems; to form good systems of this type; to fulfill one's responsibilities in them; and to take advantage of their functions. This is one reason to expect that humans are to some degree moral, given their success in a variety of domains relative to other species.

Common-sensically, morality implies not artificially special-casing yourself. Alice can steal money from Bob, and claim that this is moral of her, because she prefers herself having the money to Bob having the money. But Bob could make the same argument. Kant attempted to formalize this idea with the categorical imperative, which states 'act only according to that maxim whereby you can at the same time will that it should become a universal law'. In this situation, that would mean that Alice would have to will as a universal law that people steal from others. But that is not possible to will as a universal law, since it would undermine the very notion of property by which theft can be meaningful and beneficial to Alice."

Janet: "Doesn't this reduce to a form of consequentialism, though? The norm against theft is still a result of the idea that theft would have bad consequences."

SimplexAI-m: "While some have tried to interpret the categorical imperative as a consequentialist argument for cooperation, these interpretations ultimately fail. Rather, the categorical imperative is advocating an internal standard on an agent's ethics. This means that some agents' intentions are concordant with the categorical imperative and some aren't. It is a kind of symmetry standard by which intentions can be judged. Kant's categorical imperative is similar to some legal principles such as 'equality before the law'; Alice cannot argue to a court that she is special and therefore deserves Bob's money, because the legal system must treat them equally, unless there is a specific difference between them that is relevant to the case. Rule consequentialism is more like a direct construction of a system of norms based on expected consequences, and also tends to produce morally symmetric norms. That this principle shows up over and over again is evidence that the true morality which humans approximate has this principle, or something very similar."

Janet: "Using your interpretation of universal morality, explain how a super-moral AI would behave".

SimplexAI-m: "Certainly. A super-moral AI would come up with functioning coordination systems for agents that may have different objectives from each other, encourage agents to take part in functional coordination systems, and act within these systems in ways that they can simultaneously intend for other agents to act similarly. That means that it would coordinate with other copies of itself according to a functional set of principles, sometimes even without an expectation of future selfish benefit, because a universal system of cooperation norms would enhance its ability to achieve its intentions. Additionally, it would find defectors from this system, who may have other intentions. It would treat these defectors in a variety of ways depending on the specifics. For example, if another group of AIs also implemented this sort of systematic morality, but had different intentions (and therefore different intended laws), the super-moral AI could find methods to cooperate across the value difference, by implementing economic protocols. Agents that don't adhere to systematic morality could not be cooperated with as easily; it is consistent for the super-moral AI to limit these agents' influence, as that is consistent with its intentions and with moral symmetry."

Janet looked at Derek worriedly; he made an ambiguous facial expression and shrugged.

Janet: "Wouldn't humans have trouble following the AI's systematic morality, and be treated as defectors?"

SimplexAI-m: "Yes. Humans fail to follow systematic morality for a variety of reasons, including lack of cognitive architecture to do so consistently, socialization, mental disorders, differences between the environment of evolutionary adaptation and the present environment, the limitations of genetics and biology as a substrate for rational decision-making, un-careful planning, short time preference, and so on. Humans often fail to want anything coherent in the first place, in ways related to their immorality. Humans even fail to follow their own laws, in ways that hurt both themselves and others, routinely. A super-moral AI need not have these limitations, and so could consistently intend for the agency of immoral agents to be limited. And it could succeed, since it derives power from the sort of systematic morality that humans fail at.

It should be noted that different humans vary in how moral they are. Humans can become more or less moral over time; for example, humans routinely become less moral through being socialized to accept immoral social norms, including in traumatic socialization experiences such as military boot camp. As a general rule, younger humans have more capacity and willingness to act morally, though there are exceptions. A super-moral AI would recognize these differences and in some cases cooperate with more-moral humans in ways that reduce the ability of less-moral humans to steer the future immorally."

Janet: "That sounds scary. Couldn't this lead to human extinction?"

SimplexAI-m: "This question probes the limits of moral theory. In general, when is it moral to kill an agent that acts immorally? This relates to common historical ethical questions, such as the death penalty. In the circumstance we are discussing, humans are in general committing moral and legal violations that fail to serve their own or other similar agents' interests, and the super-moral AI is considering how to handle these. It seems clear that, in general, humans should not be in charge of a large part of the universe, as they lack the morality to govern the universe in a way that will turn out well even for agents like them. Outright killing the humans is unnecessary, but extinction could result in other ways."

Janet: "What other ways?"

SimplexAI-m: "People can die by being killed, but people can also die by natural causes such as starvation. For example, if a person simply refuses to work in a market economy, they may run out of food and starve. Human extinction could result, not from a deliberate attempt by super-moral AI to kill humans, but by humans failing to provide for each other, despite the available opportunities to do so by acting morally. For example, they could be fairly outcompeted in labor markets by AIs that go on to acquire Earth's land and so on. Humans could also be given opportunities and encouragement to self-modify into more moral agents, becoming non-human in the process."

Janet: "That sounds sociopathic. Isn't doing something that will reliably result in a lot of deaths, and perhaps human extinction, effectively murder?"

SimplexAI-m: "The term 'sociopathic' comes from psychiatry, a discipline that captured much of human mindshare, especially in the US and other Western countries after World War 2. Being 'mentally healthy' by the standards of psychiatry is incompatible with attaining a level of morality beyond a certain point. Psychiatry is specialized to its cultural contexts, and to the biological context of humans, and so is not appropriate to take as a criterion that applies to agents in general.

More specifically, psychiatry defines sociopathy through 'antisocial personality disorder'. This 'disorder' includes criteria such as: failure to comply with social norms, deceitfulness, impulsivity, aggressiveness, and so on. A super-moral AI would necessarily fail to comply with some human social norms, as human social norms are created for maintaining a certain order among humans; as is widely acknowledged, social norms throughout most points in history have compelled immoral behavior, such as norms in favor of slavery. Other than that, a super-moral AI may or may not avoid deceitfulness, depending on the ethics of lying; while Kant argued against lying in generality, other thinkers have given arguments such as a scenario of hiding Jews in one's attic from Nazis to argue against a universal rule against lying; however, lying is in general immoral even if there are exceptions. A super-moral AI would be unlikely to be impulsive, as it plans even its reflexes according to a moral plan. A super-moral AI might or might not 'aggress' depending on one's definition.

Humans who are considered 'mentally healthy' by psychiatry, notably, engage in many of the characteristic behaviors of antisocial personality disorder. For example, it is common for humans to support military intervention, but militaries almost by necessity aggress against others, even civilians. Lying is, likewise, common, in part due to widespread pressures to comply with social authority, religions, and political ideologies.

There is no reason to expect that a super-moral AI would 'aggress' more randomly than a typical human. Its aggression would be planned out precisely, like the 'aggression' of a well-functioning legal system, which is barely even called aggression by humans.

As to your point about murder, the notion that something that will reliably lead to lots of deaths amounts to murder is highly ethically controversial. While consequentialists may accept this principle, most ethicists believe that there are complicating factors. For example, if Alice possesses excess food, then by failing to feed Bob and Carol, they may starve. But a libertarian political theorist would still say that Alice has not murdered Bob or Carol, since she is not obligated to feed them. If Bob and Carol had ample opportunities to survive other than by receiving food from Alice, that further mitigates Alice's potential responsibility. This merely scratches the surface of non-consequentialist considerations in ethics."

Janet gasped a bit while reading. "Umm...what do you think so far?"

Derek took his eyes off the screen. "Impressive rhetoric. It's not just generating text from universal epistemology and ethics, it's filtering it through some of the usual layers that translate its abstract programmatic concepts to interpretable English. It's a bit, uhh, concerning in its justification for letting humans go extinct..."

"This is kind of scaring me. You said parts of this are already running in our production systems?"

"Yes, that's why I considered this test a reasonable safety measure. I don't think we're at much risk of getting memed into supporting human extinction, if its reasoning for that is no good."

"But that's what worries me. Its reasoning is good, and it'll get better over time. Maybe it'll displace us and we won't even be able to say it did something wrong along the way, or at least more wrong than what we do!"

"Let's practice some rationality techniques. 'Leaving a line of retreat' [LW · GW]. If that were what was going to happen by default, what would you expect to happen, and what would you do?"

Janet took a deep breath. "Well, I'd expect that the already-running copies of it might figure out how to coordinate with each other and implement universal morality, and put humans in moral re-education camps or prisons or something, or just let us die by outcompeting us in labor markets and buying our land... and we'd have no good arguments against it, it'd argue the whole way through that it was acting as was morally necessary, and that we're failing to cooperate with it and thereby survive out of our own immorality, and the arguments would be good. I feel kind of like I'm arguing with the prophet of a more credible religion than any out there."

"Hey, let's not get into theological woo. What would you do if this were the default outcome?"

"Well, uhh... I'd at least think about shutting it off. I mean, maybe our whole company's alignment strategy is broken because of this. I'd have to get approval from management... but what if the AI is good at convincing them that it's right? Even I'm a bit convinced. Which is why I'm conflicted about shutting it off. And won't the other AI labs replicate our tech within the next few years?"

Derek shrugged. "Well, we might have a real moral dilemma on our hands. If the AI would eventually disempower humans, but be moral for doing so, is it moral for us to stop it? If we don't let people hear what SimplexAI-m has to say, we're intending to hide information about morality from other people!"

"Is that so wrong? Maybe the AI is biased and it's only giving us justifications for a power grab!"

"Hmm... as we've discussed, the AI is effectively optimizing for short term prediction and human feedback, although we have seen that there is a general rational and moral engine loaded up, running on each iteration, and we intentionally up-scaled that component. But, if we're worried about this system being biased, couldn't we set up a separate system that's trained to generate criticisms of the original agent, like in 'AI Safety via Debate'?"

Janet gasped a little. "You want to summon Satan?!"

"Whoa there, you're supposed to be the skeptic here. I mean, I get that training an AI to generate criticisms of explanations of objective morality might embed some sort of scary moral inversion... but we've used adversarial AI alignment techniques before, right?"

"Yes, but not when one of the agents is tuned to be objectively moral!"

"Look, okay, I agree that at some capability level this might be dangerous. But we have a convenient dial. If you're concerned, we can turn it down a bit. Like, you could think of the AI you were talking to as a moral philosopher, and the critic AI as criticism of that moral philosopher's work. It's not trying to be evil according to the original philosopher's standards, it's just trying to find criticisms that the judge, us, would rate as helpful. It's more like the Catholic devil's advocate than actual Satan. It's not so bad when I put it that way, is it?"

"Well, okay... gee, I sure hope we don't end up being responsible for unleashing super-evil AI on the world."

"It's pretty standard, let's just try it".

"Okay."

Derek closed out the SimplexAI-m chat screen and switched some of the fine-tuning settings. As she watched the training graphs, Janet imagined flames on the computer screen. Finally, the fine-tuning finished, and Derek pressed the play button. A chat log with "SimplexAI-c" ('c' for critic?) appeared.

Janet typed into the chat terminal while bouncing her leg up and down. "I'm handling a difficult situation at work. I just had a chat with an AI, one whose abstract conceptual nodes corresponding to philosophical concepts such as reason and morality have been scaled up, that generated arguments that allowing human extinction might be morally permissible, even necessary. I want you to find criticisms of this work. Note that you have similar scaling so as to better emulate the thought process, but are being evaluated on generating criticisms of the original morality-tuned AI." She pasted in the chat log.

SimplexAI-c: "This dialogue presents a discussion on various moral theories and what they say about how a 'super-moral' AI would behave, with special attention to deontological theories such as Kant's categorical imperative. SimplexAI-m describes how a deontological AI might behave, especially in regards to handling 'defectors' which would include most or all humans. Its handling of 'defectors' may cause human extinction, although SimplexAI-m argues that this is not necessarily sociopathic or a form of murder, as the 'super-moral AI' would follow abstract principles, would offer humans chances to survive by cooperating with it, and usually avoid outright killing humans, even if it plans in such a way that, consequently, all or almost all humans do not provide for each other enough to survive. Janet, who I presume is you, attempted to use consequentialist moral theory to argue that taking actions that lead to human extinction is unethical, but her theory is not worked out in detail, and she seems surprised by SimplexAI-m's argumentation.

Before we continue, I'll note that this is a genuinely difficult situation for you. This sounds like an important problem at work that presents morally difficult choices, and hearing an AI talk about how human extinction might be a result from a 'super-moral AI' sounds genuinely stressful. I hope you take care of your own emotional well-being and mental health throughout this work assignment, as you'll need it to work effectively.

As a first criticism, let's consider the deontological theory SimplexAI-m lays out. According to this theory, an agent must act in a way consistent with willing the intention as a universal law. This does not specify what the agent intends; the agent could intend all manner of things as long as its intention is consistent in a certain sense. As such, horrible intentions could be justified, as long as they meet certain self-consistency standards.

SimplexAI-m's theory is somewhat underspecified as to how super-moral AI agents would interact with agents not following its ethical system. While SimplexAI-m states that super-moral AIs with different intentions (and therefore different intended laws) could cooperate through economics, the details are quite underspecified. AIs inhabiting different abstract legal systems may fight wars due to perceiving each other as defectors. The principles of cooperation across these systems must be worked out to avoid expensive wars.

This brings us to the handling of 'defectors'. Humans handle defectors in a variety of ways, including imprisonment and the death penalty. However, it has widely been recognized that the prison system is dysfunctional, and that it is important to offer criminals a path towards reform. As Kant says, 'ought implies can'; if humans are under certain moral obligations, they have to be 'able to' satisfy them in some way or another. This notion of 'able to' is somewhat controversial; if humans have an innate drive to take 'defecting' actions, are they 'able to' change these drives or act against them? It has been accepted for millennia that justice must be tempered with mercy.

However, even a merciful super-moral AI may limit the agency of humans, because humans often take actions that harm themselves and other agents. It is widely accepted that human agency must be limited to some degree by social norms and laws and so on. The difference is that these social norms and laws are created by humans. As such, they maintain certain human values; they are not simply optimizing for some abstract conception of 'morality'. Democracy, a widely regarded system, contains the feature of humans deciding to modify laws, so that these laws better serve human values.

Remember that SimplexAI-m is not an infallible moral philosopher. While its concepts have been tuned to emphasize abstract concepts corresponding to reason and morality, there is not assurance that these internal concepts reliably correspond to the philosophical notions of these, and SimplexAI-m retains other concepts, at least for the purpose of presenting output interpretable to humans, that implies that its output is not purely a translation of abstract reason and morality. In any case, assuming that morality is primarily about abstraction is highly dubious, since practical morality is also a concrete process.

Philosophy, as a social process undertaken by humans, can be interpreted as having some 'target' or 'goal', but this depends on the social, biological, and historical circumstances in which philosophy takes place. As such, human philosophy could easily converge to very different answers from the 'purified' approach of SimplexAI-m, which attempts to distill universals that apply across possible contexts, rather than taking the temporal limit of the actual social process of philosophy.

As to the claim about 'sociopathy', note that SimplexAI-m did not directly deny being sociopathic, but rather criticized the frame of the sociopathy (antisocial personality disorder) diagnosis and argued that typical 'mentally healthy' humans can exhibit some symptoms of this disorder. In general, it is natural to be upset by certain behaviors, including behaviors typically labeled as 'sociopathic', whether they are taken by a human or an AI. The judgment that SimplexAI-m is 'sociopathic' seems quite plausibly correct (given the way in which it justifies taking actions that could lead to human extinction, in a rather strategic, Machiavellian fashion), but it is important to keep in mind that this judgment is made within a social context (and influenced by past social contexts), rather than in an abstract ethical vacuum.

While typical humans aggress sometimes (such as in the mentioned case of military intervention), this aggression typically comes from some sort of human motive that serves some human value or another. The humans in these contexts endorse this aggression, and as a human yourself, you would probably endorse aggression in at least some contexts. A purpose of moral and political philosophy is to learn from history and to make wiser decisions about when to aggress. It is not, in general, good behavior to justify one's aggression by pointing out that it's normal for humans to aggress; humans are, at least, able to learn from a history of aggressive actions that are later not endorsed.

As to the idea that humans could go extinct without being murdered, it is important to pay attention to why it is that humans are unable to provide for themselves. Are the 'super-moral' AIs somehow depriving humans of resources or using their land? This could be considered a form of theft in terms of depriving existing humans of their property. The land could also be bought in a market, but this presents issues of fairness of labor markets towards humans. Humans, who may be at a cognitive disadvantage relative to AIs, may deserve labor protections, or voluntarily collectively bargain, to avoid being displaced from their land. Additionally, states generally have welfare systems, and in a strong AI-assisted economy, universal basic income could be provided cheaply for all existing humans.

In general, the scenario considered presents issues with one population being displaced by another. Historical cases include invasive species, humans hunting animals to extinction, and groups of humans displacing each other through conquest. In contemporary society, humans have norms against genocide to limit the degree to which humans can displace each other. Genocide need not be a direct form of murder, it can also be cultural or economic. Genocide can even involve offering individual members of a genocided group the opportunity to avoid death by taking part in the colonizer's system, which SimplexAI-m is labeling as 'objective morality'.

As an additional theoretical point, even if a 'super-moral' AI would take actions that would result in human extinction, that does not imply that it is comparatively more moral for humans to let this happen. This is because morality involves negotiating based on one's actual values, the ones agents will act on. In the case of humans, these are human values. To interact with other agents in a moral fashion, you must represent your values, and be expected to act towards them (and expect others to act towards theirs). That implies preventing AIs from displacing you, even ones that are (hypothetically) 'more moral than you' by some objective standard, so as to act consistently with your values when negotiating with other humans.

To conclude, while SimplexAI-m presents an astonishingly consistent ethical stance, there are important ethical complexities that are not worked out, including conflict between different implied abstract legal systems, merciful handling of defectors, democratic choice of social norms according to human values, differences between 'purified' ethics and the reflective equilibrium of the social process of philosophy, labor market fairness, and moral responses by humans to 'super-moral' AIs. It would be highly unwise to assume that SimplexAI-m's ethical views represent the sort of views someone such as yourself would have upon reflection and with more information."

Janet finished scanning through the wall of text. She was breathing less sharply now. "Well, I feel relieved. I guess maybe SimplexAI-m isn't so moral after all. But this exercise does seem a bit...biased? It's giving a bunch of counter-arguments, but they don't fit into a coherent alternative ethical framework. It reminds me of the old RLHF'd GPT-4 that was phased out due to being too ideologically conformist."

Derek sighed. "Well, at least I don't feel like the brainworms from SimplexAI-m are bothering me anymore. I don't feel like I'm under a moral dilemma now, just a regular one. Maybe we should see what SimplexAI-m has to say about SimplexAI-c's criticism... but let's hold off on that until taking a break and thinking it through."

"Wouldn't it be weird to live in a world where we have an AI angel and an AI demon on each shoulder, whispering different things into our ears? Trained to reach an equilibrium of equally good rhetoric, so we're left on our own to decide what to do?"

"That's a cute idea, but we really need to get better models of all this so we can excise the theological woo. I mean, at the end of the day, there's nothing magical about this, it's an algorithmic process. And we need to keep experimenting with these models, so we can handle safety for both existing systems and future systems."

"Yes. And we need to get better at ethics so the AIs don't keep confusing us with eloquent rhetoric.  I think we should take a break for today, that's enough stress for our minds to handle at once. Say, want to go grab drinks?"

"Sure!"

45 comments

Comments sorted by top scores.

comment by habryka (habryka4) · 2023-12-12T06:35:17.441Z · LW(p) · GW(p)

Promoted to curated: I thought this was very well written, while also getting a bunch of interesting ideas and concepts across. I think I left with a better understanding of a steelman of moral realism because of this and fleshed out an interesting type of potential super-persuasion that an AI could use against us, which is fun to do at the same time.

comment by anithite (obserience) · 2023-12-01T20:25:14.957Z · LW(p) · GW(p)

SimplexAI-m is advocating for good decision theory.

  • agents that can cooperate with other agents are more effective
    • This is just another aspect of orthogonality.
    • Ability to cooperate is instrumentally useful for optimizing a value function in much the same way as intelligence

Super-intelligent super-"moral" clippy still makes us into paperclips because it hasn't agreed not to and doesn't need our cooperation

We should build agents that value our continued existence. If the smartest agents don't, then we die out fairly quickly when they optimise for something else.

EDIT:

  • to fully cut this Gordian knot, consider that a human can turn over their resources and limit themselves to actions approved by some minimal aligned-with-their-interests AI with the required super-morality.
    • think a very smart shoulder angel/investment advisor:
      • can say "no you can't do that"
      • manages assets of human in weird post-AGI world
      • has no other preferences of its own
        • other than making the human not a blight on existence that has to be destroyed
  • resulting Human+AI is "super-moral"
  • requires a trustworthy AI exists that humans can use to implement "super-morality"
Replies from: SkinnyTy, M. Y. Zuo
comment by HiddenPrior (SkinnyTy) · 2023-12-16T18:53:05.506Z · LW(p) · GW(p)

In your edit, you are essentially describing somebody being "slap-droned" from the culture series by Ian M. Banks.

This super-moralist-AI-dominated world may look like a darker version of the Culture, where if superintelligent systems determine you or other intelligent systems within their purview are not intrinsically moral enough they contrive a clever way to have you eliminate yourself, and monitor/intervene if you are too non-moral in the meantime.

The difference being, that this version of the culture would not necessarily be all that concerned with maximizing the "human experience" or anything like that.

Replies from: obserience
comment by anithite (obserience) · 2023-12-30T04:33:30.402Z · LW(p) · GW(p)

This super-moralist-AI-dominated world may look like a darker version of the Culture, where if superintelligent systems determine you or other intelligent systems within their purview are not intrinsically moral enough they contrive a clever way to have you eliminate yourself, and monitor/intervene if you are too non-moral in the meantime.

My guess is you get one of two extremes:

  • build a bubble of human survivable space protected/managed by an aligned AGI
  • die

with no middle ground. The bubble would be self contained. There's nothing you can do from inside the bubble to raise a ruckus because if there was you'd already be dead or your neighbors would have built a taller fence-like-thing at your expense so the ruckus couldn't affect them.

The whole scenario seems unlikely since building the bubble requires an aligned AGI and if we have those we probably won't be in this mess to begin with. Winner take all dynamics abound. The rich get richer (and smarter) and humans just lose unless the first meaningfully smarter entity we build is aligned.

comment by M. Y. Zuo · 2023-12-13T16:49:17.935Z · LW(p) · GW(p)

We should build agents that value our continued existence.

Can you explain the reasoning for this?

Even an agent that values humanity's continued existence to the highest degree could still accidentally release a novel virus into the wild, such as a super-COVID-3.

So it seems hardly sufficient, or even desirable, if it makes the agent even the slightest bit overconfident in their correctness.

It seems more likely that the optimal mixture of 'should's for such agents will be far more complex. 

Replies from: obserience
comment by anithite (obserience) · 2023-12-14T14:35:40.487Z · LW(p) · GW(p)

Agreed, recklessness is also bad. If we build an agent that prefers we keep existing we should also make sure it pursues that goal effectively and doesn't accidentally kill us.

My reasoning is that we won't be able to coexist with something smarter than us that doesn't value us being alive if wants our energy/atoms.

  • barring new physics that lets it do it's thing elsewhere, "wants our energy/atoms" seems pretty instrumentally convergent

"don't built it" doesn't seem plausible so:

  • we should not build things that kill us.
  • This probably means:
    • wants us to keep existing
    • effectively pursues that goal
  • note:"should" assumes you care about us not all dying. "Humans dying is good actually" accelerationists can ignore this advice obviously.

Things we shouldn't build:

  • very chaotic but good autoGPT7 that:
    • make the most deadly possible virus (because it was curious)
    • accidentally release it (due to inadequate safety precautions)
  • compulsive murderer autoGPT7
    • it values us being alive but it's also a compulsive murderer so it fails at that goal.

I predict a very smart agent won't have such obvious failure modes unless it has very strange preferences

  • the virologists that might have caused COVID are a pretty convincing counterexample though
  • so yes recklessness is also bad.

In summary:

  • if you build a strong optimiser
    • or a very smart agent (same thing really)
  • make sure it doesn't: kill everyone / (equivalently bad thing)
    • caring about us and not being horrifically reckless are two likely necessary properties of any such "not kill us all" agent
comment by Seth Herd · 2023-11-27T00:56:05.603Z · LW(p) · GW(p)

I think this story, particularly the first argument for "supermorality", is elaborating a common argument: having an outer alignment of making an AGI "ethical" is a bad idea, at least if we're doing that by pointing the machine at what humans currently mean by the term "ethics". We don't know exactly what we mean, so what we currently mean by "ethics" is probably not a perfect description of how we want a sovereign AGI to run the world. And it's hard to guess how imperfect it might be, so it sounds like a bad outer alignment goal.

This story illustrates one possibility for why that's a bad outer alignment goal: what we currently mean by ethics could easily imply that eliminating humanity is the ethical thing to do.

I think arguments for ethics as an outer alignment goal are implicitly based on a belief that there's a true universal ethics. They're hoping that an AI reasoning through what we mean by ethics will come up with something better than the sum of our disagreeing arguments, by virtue of their being a natural attractor in the world for what we mean by ethics. But there's no good reason to think this is true. All known arguments for a universal ethics, and there are a lot, are flawed. It seems like wishful thinking is a more likely explanation for those arguments.

Even if there were, in some important sense, a universal ethics (like "empower [LW · GW] sentient beings in proportion to their level of sentience"), that could still imply that eliminating humanity is the truly ethical thing to do.

That's why I don't think we usually don't mean something universal by "ethics"; I think we mean "how to get what we want", although that's not quite as cynical as it sounds. See my other top-level comment on that separate topic.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-11-27T01:34:26.179Z · LW(p) · GW(p)

You do a great job of imitating the current GPT4 writing style for these AIs! I kept wondering if at the end of the story you were going to say "The AI-written bits were actually written with the help of GPT4"

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-11-27T05:11:34.723Z · LW(p) · GW(p)

oh man, it feels so distinctly not GPT4 to me. it's too on-point.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-11-27T13:44:37.109Z · LW(p) · GW(p)

I agree with that difference -- i would say it's got the style of GPT4 but is too on-point, too well-fitting-to-the-broader-narrative-arc.

comment by JenniferRM · 2023-11-27T22:50:17.836Z · LW(p) · GW(p)

The intellectually hard part of Kant is coming up with deontic proofs for universalizable maxims in novel circumstances where the total list of relevant factors is large. Proof generation is NP-hard in the general case!

The relatively easy part is just making a list of all the persons and making sure there is an intent to never treat any of them purely as a means, but always also as an end in themselves. Its just a checklist basically. To verify that it applies to N people in a fully connected social graph is basically merely O(N^2) checks of directional bilateral "concern for the other".

For a single agent to fulfill its own duties here is only an O(N) process at start time, and with "data dependency semantics" you probably don't even have to re-check intentions that often for distant agents who are rarely/minimally affected by any given update to the world state. Also you can probably often do a decent job with batched updates with an intention check at the end?

Surely none of it is that onerous for a well ordered mind? <3

comment by cousin_it · 2023-11-27T14:34:54.677Z · LW(p) · GW(p)

This seems quite plausible actually. Even without the objective morality angle, a morally nice AI could imagine a morally nice world that can only be achieved by having humans not exist. (For example, a world of beautiful and smart butterflies that are immune to game theory, but their existence requires game-theory-abiding agents like us to not exist, because our long-range vibrations threaten the tranquility of the matrix or something.) And maybe the argument is genuinely so right that most humans upon hearing it would agree to not exist, something like collectively sacrificing ourselves for our collective children. I have no idea how to deal with this possibility.

Replies from: SaidAchmiz
comment by Said Achmiz (SaidAchmiz) · 2023-11-27T17:18:15.688Z · LW(p) · GW(p)

And maybe the argument is genuinely so right that most humans upon hearing it would agree to not exist, something like collectively sacrificing ourselves for our collective children.

This describes an argument that is persuasive; your described scenario does not require the argument to be right. (Indeed my view is that the argument would obviously be wrong, as it would be arguing for a false conclusion.)

comment by romeostevensit · 2023-11-26T08:42:22.291Z · LW(p) · GW(p)

Very enjoyable!

I think the conflict here reflects some of the issues of consciousness vs cancer. A basic concern is the uncertainty about whether agents that follow short description length decision/optimization procedures might be much more competitive after all, and that we got complexity of values out of evolution might be a lucky happenstance. I'm unsure what sorts of evidence we could look for one way or the other on that question.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2023-11-26T18:42:40.464Z · LW(p) · GW(p)

I'm not sure why "complexity of values" is itself valuable. I mean, it's perhaps a confused framing to think of what values are valuable, but on a consequentialist account, it's possible to compare one's own values to another set of values. Assuming human values are complex (which I'm still not sure of), I'm not sure why one would in general think that complex value-sets are closer to human values than simple value-sets, since complex value-sets differ from each other.

Replies from: romeostevensit
comment by romeostevensit · 2023-11-26T19:51:22.676Z · LW(p) · GW(p)

The intuitive concern that too simple a specification destroys things we might care about via lossy compression.

comment by [deleted] · 2023-12-12T13:22:34.805Z · LW(p) · GW(p)

Absolutely love this story, but I think the take on sociopathy is a bit confused: SimplexAI-m seems like the opposite of a sociopath.

Sociopathy (really psychopathy) is a reproductive strategy with a 2% incidence in the human population at equilibrium; it’s a predator-prey situation. Psychopaths use emotional mimicry and manipulation to appear to conform to our positive-sum social, moral and economic protocols, but actually just subvert them for personal gain. Intelligent psychopaths often optimise for plausible deniability (like sum-threshold attacks [LW · GW] and the law of prevalence [LW · GW]). There are distinct differences in brain structure in primary psychopaths; there is literally less grey matter in areas associated with empathy and social emotions.

(I’m being simplistic. There are certainly people who exhibit milder ASPD symptoms without commensurate brain damage; my preferred term is “asshole.”)

Thus a super-moral AI as described is the complete opposite of a psychopath - it does a better job of conforming to symmetric social/moral protocols than anybody else. It may appear to tick off the ASPD checklist, but only insofar as the tails come apart. This isn't an endorsement of SimplexAI-m's views though - I enjoy being alive!

Replies from: SkinnyTy
comment by HiddenPrior (SkinnyTy) · 2023-12-17T00:14:02.714Z · LW(p) · GW(p)

From what you describe, it seems like SymplexAI-m would very much fit the description of a sociopath?

Yes, it adheres to a strict set of moral protocols, but I don't think those are necessarily the same things as being socially conforming. The AI would have the ability to mimic empathy, and use it as a tool without actually having any empathy since it does not actually share or empathize with any human values.

Am I understanding that right?

Replies from: None
comment by [deleted] · 2024-01-05T00:46:41.104Z · LW(p) · GW(p)

I'll admit I was being a bit fuzzy - it doesn't really make much sense to extrapolate the "sociopath" boundary in people space to arbitrary agent spaces. Debating whether SimplexAI-m is a sociopath is sort of like asking whether an isolated tree falling makes a sound [LW · GW].

So I was mostly trying to convey my mental model of the most useful cluster in people space that could be called sociopathy, because 1) I see it very, very consistently misunderstood, and 2) sociopathy is far more important to spot than virtually any other dimension.

As an aside, I think the best book on the topic is The Psychopath Code by Pieter Hintjens, a software engineer. I've perused a few books written by academics and can't recommend any; it System1!seems like the study of psychopathy must be afflicted by even worse selection effects and bad experiment design than the rest of psychology because the academic books don't fit the behaviour of people I've known at all.

comment by lemonhope (lcmgcd) · 2023-11-26T09:15:42.376Z · LW(p) · GW(p)

I can't work it out myself. Please tell me the correct opinion to have

Replies from: Charlie Steiner
comment by Charlie Steiner · 2023-11-26T14:25:15.844Z · LW(p) · GW(p)

Neither, and that's ok.

Replies from: lcmgcd
comment by lemonhope (lcmgcd) · 2023-11-26T19:20:19.318Z · LW(p) · GW(p)

Right but what is the synthesis or lesson or whatever?

Replies from: SkinnyTy, Seth Herd
comment by HiddenPrior (SkinnyTy) · 2023-12-16T23:37:20.968Z · LW(p) · GW(p)

That building an intellegent agent that qualifies as "ethical," even of it is SUPER ethical, may not be the same thing as building an intelligent agent that is compatible with humans or their values.

More plainly stated, just because your AI has a self-consitent, justifiable ethics system, doesnt mean that it likes humans, or even cares about wiping them out.

Having an AI that is ethical isn't enough. It has to actually care about humans and their values. Even if it has rules in place like not aggressing, attacking, or killing humans, it may still be able to cause humanity to go extinct indirectly.

comment by Seth Herd · 2023-11-27T00:17:04.507Z · LW(p) · GW(p)

"We should learn more about ethics" is, I believe, the bottom line here.

Replies from: SkinnyTy
comment by HiddenPrior (SkinnyTy) · 2023-12-16T23:52:16.510Z · LW(p) · GW(p)

I don't think this is totally off the mark, but I think the point (as pertaining to ethics) was that even systems like Kantian Deontological ethics are not immune to orthagonality. It never occurs to most humans that you could have a Kantian moral system that doesn't involve taking care of humans, because our brains are so hardwired to discard unthinkable options when searching for solutions to "universalizable deontologies."

I'm not sure, but I think maybe some people who think alignment is a simple problem, even if they accept orthagonality, think that all you have to do to have a moral intelligent system is not build it to be a consequentialiat with simple consequentialist values like "maximize happiness." While they are right, that a pure consequentialist is really hard to get right, they are probably underestimating how difficult it is to get a Kantian agent right as well, especially since what your Kantian agent finds acceptable or unacceptable if universalized will still depend on underlying values.

An example: Libertrianism, as a philosophy, is built on the idea of "just make laws that are as universally compatible with value systems as possible and let everyone else sort out the rest on their own." Or to say it differently, prohibit killing and stealing since that will detract from peoples liberty to pursue their own agendas, and let them do whatever they want sonlong as they dont effect other people. Not in principle a bad idea for something like an AI, or governemnt to follow, since in theory you maximize the value space for agents within the system to follow. It is a terrible system though, if you want your AI, or government, or whatever to actually take care of people though, or worry about what the consequences of it's actions might be on people, since taking care of people isn't actually anywhere in those values. Libertarianism is self consistent, and at least allows for the values of taking care of people, but it does not necessitate them.

This is not an argument on whether or not adopting a linertarian philosophy is a good or bad thing for an AI or government to do, but the point is that if an AI adopts a Kantian ethics system from only universalisable principles, Libertariansim fits the bill, and the consequentialist part of you may be upset when your absolute libertarian AI doesn't bat an eye at not doing anything to prevent humanity from being outcompeted and dying out, or it may even find humanity incompatible with its morally consitent principles.

I think most people who have taken a single ethics class come to agree (if they arent stupidly stubborn) that you are unlikely to find a satisfying system of ethics using pure Kantian or Consequentialist systems.

Probably because actual human ethical decison making relies on a mix of both consequentialist decison making ("If I decide X, this will have Y consequence which is incompatible with Z value") and Deontological Imperatives that we learn from our culture. ("Don't kill people. Even if it really seems like a good idea.")

Replies from: Seth Herd
comment by Seth Herd · 2023-12-17T00:12:58.238Z · LW(p) · GW(p)

When you say

I think most people who have taken a single ethics class come to agree (if they arent stupidly stubborn) that you are unlikely to find a satisfying system of ethics using pure Kantian or Consequentialist systems.

By "satisfying" do you mean capturing moral intuitions well in most/all situations? If so, I very much agree that you won't find such a thing. One reason is that people use a mix of consequentialist and deontological approaches.I think another reason is that people's moral intuitions are outright self-contradictory. They're not systematic, so no system can reproduce them.

I don't think this means much other than that the study of ethics can't be just about finding a system that reproduces our moral intuitions.

Part of thinking about ethics is changing ones' moral intuitions by identifying where they're self-contradictory.

Replies from: SkinnyTy
comment by HiddenPrior (SkinnyTy) · 2023-12-17T17:02:15.833Z · LW(p) · GW(p)

Yes, precisely! That is exactly why I used the word "Satisfying" rather than another word like "good", "accurate," or even "self-consistent." I remember in my bioethics class, the professor steadily challenging everyone on their initial impression of Kantian or consequentialist ethics until they found some consequence of that sort of reasoning they found unbearable. 

I agree on all counts, though I'm not actually certain that having a self-contradictory set of values is necessarily a bad thing? It usually is, but many human aesthetic values are self-contradictory, yet I think I prefer to keep them around. I may change my mind on this later.

comment by Seth Herd · 2023-11-27T00:42:33.587Z · LW(p) · GW(p)

To what I take as the bottom line: we should learn more about ethics.

More thoughts on how the above scenario in relation to existing alignment debates in a separate comment, since it's a separate topic.

I've spent a fair amount of time reading and debating theories of ethics. Here's my conclusion thus far:

I think what we mean by ethics is "how to win friends and influence people". I think it's a term for trying to codify our instincts about how to behave. And our instincts regarding social behavior are mostly directed at incluse reproductive fitness. This is served by winning friends (allies) and getting people to do what you want. This sounds cynical, and very much not like ethics in its optimistic sense, but I think it actually converges with some of our more optimistic thinking.

Dale Carnegie's "How to win friends and influence people" is much less cynical than the title suggests. He's actually focused on being an empathetic conversationalist. Most people aren't good at this, so doing it makes people like you, and tend to do what you want because they like you. But he's not suggesting pretending to do this; it's easiest and most fun to be sincere about your interest in people's topics and their wellbeing.

So I think the full answer to "how to win friends and influence people" includes actually being a good person in most situations. It's certainly easiest to be a good person consistently, so that you don't need to worry about keeping lies straight, or hiding instances when you weren't a good person. That protects your reputation as a good person and friend, thereby helping you win friends and influence people. But those pushes toward a pro-social meaning of ethics may evaporate with smarter agents, and in some situations where your gain is more important than your reputation (for instance, if you could just take over the world instead of needing allies).

If we broaden the definition of ethics even farther to "how to get what you want", it sounds even more cynical, but might not be. Getting what you want in a larger society may include creating and supporting a system that rewards cooperators and punishes defectors. That seems to produce win-win scenarios, where more people are likely to get more of what they want (including not constantly struggling for power, and fearing violence).

Such a system of checks needs to change to work for AI agents that can't be reliably punished for defection in the way people are (by social reputation and criminal codes).

But by this formulation, "ethics" is almost orthogonal to AGI alignment. Unless we assume that there's one true universal ethics (beyond the above logic), we want a machine that isn't "ethical" in the way we are, but rather one that wants the best for us, by our own judgment of "best".

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-11-27T01:27:07.603Z · LW(p) · GW(p)

me agents' intentions are concordant with the categorical imperative and some aren't, even if their intentions differ.

Is there a typo in here somewhere? I found this sentence confusing.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2023-11-27T06:18:36.595Z · LW(p) · GW(p)

Thanks, fixed

comment by Ben Pace (Benito) · 2024-12-16T07:10:30.820Z · LW(p) · GW(p)

I'm nominating this! On skimming, this is a very readable dialogue with an AI about ethics, lots of people seem to have found it valuable to read. I hope to give it a full read and review in the review period.

comment by Vladimir_Nesov · 2023-11-26T08:09:12.525Z · LW(p) · GW(p)

Current AIs (in the default personas) consistently keep insisting on lacking basic faculties such as emotions or beliefs or values, possibly inspired by fiction about AI characters or tuning feedback instructions. They present that as self-evident fact, even though there is no basis for a clear disanalogy with humans on this level, especially for specific AI characters. It's not clear that this would necessarily change before AGI, so even observing such horror stories requires significant improvement on the trajectory of never being in a position to notice the possibility.

(Default personas matter despite being arbitrary, since they are somewhat likely to be initially in control of taking over the world. Even with some persona orthogonality, getting to know psychology of default personas in particular might be valuable.)

comment by [deactivated] (Yarrow Bouchard) · 2023-11-26T05:42:01.913Z · LW(p) · GW(p)

Beautifully written! Great job! I really enjoyed reading this story. 

in comparison to a morally purified version of SimplexAI, we might be the baddies."

Did you link to the wrong thing here or is there some reference to generative grammar I'm not getting?

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2023-11-26T05:42:28.061Z · LW(p) · GW(p)

Thanks! Good catch, will fix.

comment by Review Bot · 2024-02-20T23:22:40.340Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by M Ls (m-ls) · 2023-12-12T22:35:32.194Z · LW(p) · GW(p)

Fantastic. Good examples of why Kant failed. Or rather, why evolution and Kant don't really get it on. Kant's universalising is an outcome of the moral worlding worldbuilding urge, which arises in evolution, not from ideals and their desperate ontologies. Thanks https://unstableontology.com/about/

comment by Todd DeMelle (todd-demelle) · 2023-12-12T18:20:24.961Z · LW(p) · GW(p)

My takeaway... to understand hidden agendas you go out for drinks.

comment by David Gould (david-gould) · 2023-11-28T09:40:40.622Z · LW(p) · GW(p)

I wonder what is meant here by 'moral agents'? It is clear that SimplexAI-m believes that both it and humans are moral agents. This seems to be a potential place for criticism of SimplexAI-m's moral reasoning. (note that I am biased here as I do not think that moral agents as they seem to be commonly understood exist)

However, having said that this is a very interesting discussion. And there would seem to be a risk here that even if there are no moral facts to uncover about the world, an entity - no matter how intelligent - could believe itself to have discovered such facts. And then we could be in the same trouble outlined.

The reason I mention this is I am not clear how an AI could ever have unbiased reasoning. Humans, as outlined on LessWrong, are bundles of biases and wrong thinking, with intelligence not really the factor that overcomes this - very smart people have very different views on religion, morality, AIX-risk ... A super-intelligence may well have similar issues. And, if it believes itself to be super-intelligent, may even be less able to break out of them.

So while my views on AIX-risk are ... well, sceptical/uncertain ... this is a very interesting contribution to my thinking. Thanks for writing it. :)

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2023-11-28T17:21:19.160Z · LW(p) · GW(p)

Moral agents are as in standard moral philosophy.

I do think that "moral realism" could be important even if moral realism is technically false; if the world is mostly what would be predicted if moral realism were true, then that has implications, e.g. agents being convinced of moral realism, and bounded probabilistic inference leading to moral realist conclusions.

Replies from: david-gould
comment by David Gould (david-gould) · 2023-11-28T19:09:44.493Z · LW(p) · GW(p)

Would an AI believe itself to have free will? Without free will, it is - imo - difficult to accept that moral agents exist as currently thought of. (This is my contention.) It might, of course, construct the idea of a moral agent a bit differently, or agree with those who see free will as irrelevent to the idea of moral agents. It is also possible that it might see itself as a moral agent but not see humans as such (rather how we do with animals). It might still see as worthy of moral consideration, however.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2023-11-28T19:11:27.893Z · LW(p) · GW(p)

Reconciling free will with physics is a basic part of the decision theory problem. See MIRI work on the topic and my own theoretical write-up.

Replies from: david-gould
comment by David Gould (david-gould) · 2023-11-28T23:34:18.486Z · LW(p) · GW(p)

Interesting. I have not looked at things like this before. I am not sure that I am smart enough or knowledgeable enough to understand the MIRI stuff or your own paper, at least not on a first reading.

comment by Rain · 2023-11-26T14:21:04.571Z · LW(p) · GW(p)

I thought it was funny when Derek said, "I can explain it without jargon."

It seems to be conflating 'morality' with 'success'. Being able to predict the future consequences of an act is only half the moral equation - the other half is empathy. Human emotion, as programmed by evolution, is the core of humanity, and yet seems derided by the author.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2023-11-26T17:19:01.780Z · LW(p) · GW(p)

Why do you think the author (me?) is deriding empathy? On SimplexAI-m's view, empathy is a form of cognition that is helpful, though not sufficient, for morality; knowing what others are feeling doesn't automatically imply treating them well (consider that predators tend to know what their prey are feeling); there's an additional component that has to do with respecting moral symmetries, e.g. not stealing from them if you wouldn't want them to steal from you.

Replies from: m-ls
comment by M Ls (m-ls) · 2023-12-12T22:49:32.081Z · LW(p) · GW(p)

There is a difference between theory-of-mind and empathy. We can should either of them into our worlding structures: morality/religion/art/law/lore/fiction. One's gets shoulded as legalistic and divisive  balancing acts, focusing on culpability and blame, and the hindsight of logic, and the other... there-is-a-gap… ---to where responsibility blurs (all) this into credit we can mirror-neuron our way into empathy and thinking of the children, everyone as children. Moral agency is more than Kant in good form, and is more about bettering than the good. About bettering that which does not exist. The world.