Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

rogerdearnaley

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

post by RogerDearnaley (roger-d-1) · 2024-01-11T12:56:29.672Z · LW · GW · 4 comments

  LLMs Are Simulators for Humans
    Aligning LLM-Powered Agents
  Humans are Not Aligned
    Humans are Frequently Allied
  Possible Motivators
    Love
    Duty
    Honor, Guilt and Shame
    Friendship and Salaries
    Law Enforcement and Shunning
    Religion
    Selflessness
  Situational Awareness
  Fictional Characters
  Superintelligence
    How does the situation compare for an ASI?
  Better Role Models for Aligned LLM Agents
    Practical Advice for Adding to the AI Role Model Training Corpus
  So Love is All You Need?
None
4 comments

TL;DR: If we manage to develop an AGI or even ASI agent in the near-term, then it is likely to be powered by, or at very least include, an LLM (probably supplemented with additional cognitive scaffolding [? · GW], long-term memory, and logic). LLMs have a lot of specific alignment challenges, strengths, and alignment-related properties that have not yet been discussed very much on the Alignment Forum/Less Wrong (until recently most of the discussion has been about abstract agents with few assumptions about the underlying technology, apart from Reinforcement Learning). For example, since LLMs are trained to simulate humans, human psychology and evolutionary psychology [? · GW] usefully apply to them. They are normally trained partly on fiction, thus features of fiction can often also apply to them. They are also obviously very fluent with human languages, ontologies, and knowledge. I explore some of the consequences of these features, both for predicting likely alignment problems and for possible alignment strategies, first for AGIs and then for ASIs. I examine alignment as a partially psychological problem, of how to induce an LLM to reliably simulate a very atypical and specific variant of human-like motivations and mentality.

LLMs Are Simulators [LW · GW] for Humans

Simulator Theory [? · GW] points out that Large Language Models (LLMs) are trained as base models using Stochastic Gradient Descent (SGD) to simulate human-derived token-generation processes on the Internet and in other content [LW · GW]. This includes tokens from both autobiographical and non-autobiographical works by solo authors, multiple-author works with careful editing such as scientific papers or news articles, and even things like the output of wikipedia communities. It also includes the dialog and actions of fictional characters in fiction written by human authors. The sole reason why LLMs produce agentic behavior is that they have been trained to simulate human agentic behavior: if you instead, as DeepMind did, train them on weather data, they will instead simulate nonagentic weather patterns. So the agents simulated by LLMs are not mesa-optimizers of the SGD token-prediction loss: they are instead misgeneralizing [LW · GW] adaption-executing [? · GW] mesaoptimizers [? · GW] of evolution [LW · GW] just like the humans they were trained from. Thus the standard analysis of Outer [? · GW] and Inner Alignment [? · GW] doesn't apply to them at all, unless and until you start applying fine-tuning to them using Reinforcement Learning (RL) or similar techniques like Direct Preference Optimization (DPO) in which case it applies only to the effects of that — or similar concerns might be "distilled" into them if you fine-tine them using SGD on the outputs of GPT-4 or some other model trained using RL, as is often done in the open-source community.

As simulations of human token-generation processes, human psychology applies to LLM-powered agents, as is evidenced by the details of many jailbreaks that work on them (like the "appeal to dead grandmother" jailbeak) and effects like emotion prompts. The LLM is simulating human behavior in a context, and (to the extent that it is able to given its capacity, architecture, and training set) it simulates how a human (or humans, or wikipedia community, or fictional character) would tend to behave in contexts like that, i.e. human psychology. So, to an extent that increases with increasing LLM capacity producing better simulations, human psychology is useful for predicting the responses of LLM-powered agents. This is a big improvement from the complete lack of information provided by the orthogonality thesis [? · GW], and seems very important, both for predicting likely alignment failures from them, and for figuring out how we might align them, and indeed how jailbreakers might then try to unalign them again. (Of course, to the decreasing but still significant extent that they're not perfect simulations, other non-human considerations will apply too.)

Aligning LLM-Powered Agents

People discussing aligning AI to human values on Less Wrong/The Alignment Forum have historically had a tendency to think in terms of utility functions. They're elegant: Utilitarianism [? · GW] is well-thought of, utility functions map that to a nice clean objective mathematical formulation with actual optima and derivatives. However, human values are complex [? · GW] and fragile [LW · GW], and we also know a huge amount about them. Basically every soft science, art, and craft we have is mostly devoted to how to make humans happier: Anthropology, Sociology, Medicine, Economics, Market Research, Ergonomics, Political Science, Literature, Massage… We have many exabytes of information on the subject: it takes up nearly half of the Dewey Decimal system. While that information might be somewhat compressible, if you somehow had AI turn it all into a utility function for you, the Kolmogorov complexity of that utility function would likely be at least in the petabytes. So even if it were a well-labeled white box, it would be so big as to be functionally almost incomprehensible, much like a neural net, and orders of magnitude larger than current ones, and would thus be significantly less clean and elegant than people often seem to assume.

Also, all the information that we currently have on this subject only covers a certain distribution of situations that we have so far investigated. For example, we know very little about what things not yet technologically possible would do to human society, wants, and needs. So, much like a neural net, any such utility function would be unreliable outside that distribution: some trends are doubtless mostly correct, it would extrapolate to some levels of accuracy for some distance in some directions, but would become more and more unreliable. So we have Knightian uncertainty. There are two ways one might deal with that: build a mathematical object that spits out not a single utility value but an estimated distribution (at a minimum, a mean and standard deviation, or a value with error bars), and then when making decisions pessimize over that Knightian uncertainty, or possibly we might be able to prebuild the pessimization over Knightian uncertainty into the function, and carefully construct an estimated-minimum-utility-over-Knightian-uncertainty function whose values always drop, in a somewhat principled way, as you leave the well-understood distribution region in any direction.

However, LLMs don't work like utility functions. They are neural nets, and while they already are quite good at duplicating human moral decisions for questions like "Which is better, A or B?", or "Is X good, or bad? About how bad? Why?", like humans they don't do this in closed functional form, they just make approximate estimates of utility levels, differences, and certainties. They can also be expected to have learnt to simulate human cognitive biases, thus like humans, in some situations if we want them to estimate utilities more accurately, we are likely to need to prompt them to "shut up and multiply [? · GW]" and perform an explicit rather than implicit calculation. So we might want to encourage them to do at least approximate explicit calculations of utility in some situations. Or, indeed, encourage them to do detailed STEM research on this to work on creating utility-function models, each annotated with their estimated error bars and regions of applicability.

To align LLM-powered agents, we don't (currently) construct utility functions. Even when doing Reinforcement Learning (RL) to fine-tune LLMs, people don't generally construct explicit reward functions: instead, they train a second neural net to make reward estimates, and then use the rewards from that reward-estimation system to do RL training on the LLM. So the "reward function" is "whatever value this trained reward-estimating transformer network spits out for the input". [One could of course try to construct a utility function like that to maximize directly, but in the absence of pessimization over Knightian uncertainty, Goodhart's Law [? · GW] seems likely to apply, so it seems very unlikely, as you applied more and more bits of optimization to optimizing the output of that utility function, that what the later bits of optimization were doing would actually still be useful.]

So in practice, a significant portion of how we motivate LLM-powered agents is via prompts and scoring rubrics. We tell them "you are an honest, helpful, and very skilled assistant, one who refuses to do anything harmful or inappropriate". We apply techniques like Constitutional AI. So we use words, not equations, and we attempt to motivate them like we would humans. Human psychology applies to prompt design for and the behavior of LLM-powered agents, so we should and do use it.

Putting this in the terminology of the stage, animatronics, and puppeteer metaphor [LW · GW], the animatronics attempt to act out human psychology, so when picking a context to feed to the stage, carefully thinking through the psychology of the scenario is a vital consideration (along with allowing for their non-human failure modes, of course).

The flip side of this is that many alignment concerns that people on Less Wrong have previously devoted a lot of time to analyzing are clearly not a problem for LLMs. LLMs are entirely capable of mapping between human languages and ontologies and their internal models of the world. So the diamond maximization problem [LW · GW] is trivial for an LLM: you can just prompt it with the phrase "Please make as much gem-grade diamond as possible". Making it do so in a less-aligned and less-impact-regularized way is actually harder: you would need a more forceful prompt, or even a jailbreak. Starting off Value Learning [? · GW] or Coherent Extrapolated Volition [? · GW] would clearly be easy for a suitably-capable LLM-powered agent: they have already read a lot of the Internet and our libraries, including a lot of the enormous amount of material relating to how to make humans happy.

Humans are Not Aligned

Humans are not aligned [LW · GW]. Joseph Stalin was not aligned with the utility of the citizenry of Russia: humans give autocracy a bad name. In groups of more than a few hundred people,^[1] the track record of picking any human and giving them absolute executive power, without feedback or controls on their use of it for a period of many years, is abysmal. If humans were aligned, we would never have needed to develop law enforcement, or locks. Having human wants and needs yourself is a very different thing from being aligned with other people's human wants and needs. So the fact that LLMs simulate agents with human-like psychology does not make those safe to give a great deal of power to, for example as Artificial Super-Inteligences (ASIs).

Humans are Frequently Allied

On the other hand, humans are a very cooperative social species. We are evolved for living as tribes of hunter-gatherers with members, generally with loose alliances with several other tribes (often including taking mates from other tribes). We are thus evolved for being good at acting cooperatively, allied with each other, and profiting on non-zero-sum games, when this makes evolutionary sense. Within a family, evolutionary psychology gives us obvious evolutionary reasons to care about the well-being and evolutionary success of blood relatives, in proportion to our relatedness to them. Human infants take a lot of raising and teaching, so we fall in love and generally pair-bond in relationships that are often long-term and often monogamous. Friendship is another form of long-term alliance, as is membership in the same tribe, or even in an allied tribe. So there are a wide range of situations where humans care quite a lot about each other's well-being, in a variety of context-dependent ways, even to the point where they may often be willing to take a non-trivial risk to their own life to save someone else's. We have evolved to be good at cooperation, at mutually-beneficial exchanges-of-altruism. As a result, we know pretty-reliable ways to build effective cultures and organizations out of humans, even very large ones, using these adaptations.

However, while this allied behavior somewhat resembles alignment, they're not the same thing, and, especially for ASIs, the differences are vital. An aligned AI is selfless: its only interest is looking out for the humans it's aligned with; so for example its only interest in self preservation is as an instrumental goal because it can't help them further if it's destroyed. So that's not "I'd willingly take a significant risk to save your life", that's "I'll happily lay down my life to aid you, even in minor ways if you want, if I'm replaceable or dispensable — would you prefer me to do that now, or save it for later?" That completely selfless motivation is the only thing that's clearly still safe when an AI is much more capable than you.^[2] In a human, that sort of mindset would be significantly past the criteria for sainthood. An allied human values their relative/lover/friend/tribemate at some fraction as much as they value themself (for a close relative, even a large fraction). Whereas an aligned AI is literally a selfless altruist: it doesn't value itself at all (other than instrumentally, as a means to an end): the only things it cares about are the humans that it's aligned to. Humans basically never act actually aligned to other humans, but they often act allied. Humans of roughly equal capabilities to you can pretty easily be allied with: for an example of one approach, all you need to do is recruit them, pay them a decent salary, and have a suitably capable law enforcement system as a backup.

One might consider calling the human behavior 'semi-aligned', but that term would be rather misleading: if we write the amount I care about myself as $m$ and the amount I care about you as $y$ , for humans' allied behavior the ratio $y / m$ is generally less than one, at best somewhere around one; while for an aligned AI, it's infinite, because $m$ is zero! That's rather more than just a matter of degree: that's a difference in kind. Aligned AIs should be selfless altruists, whereas humans are only ever somewhat-selfish mutual-altruists. The moral analog of an aligned AI is not a human explorer who risks their life in search of discovery and the hope of fame upon returning home, but the interplanetary probe that goes on a one-way mission to its extinction in the cold dark, in order to send home pictures and sensor readings.

Possible Motivators

So, we have an LLM, which generates a context-dependent distribution of agents trained on humans, which thus normally show human-like behaviors, and to which human psychology applies. Humans frequently act allied, so these agents will frequently act in allied ways. How do we make them reliably act aligned instead? An obvious first place to look is at those human traits, behaviors, and situations that raise the ratio $y / m$ as high as possible.

Love

The most obvious of those is love. Evolutionary Psychology suggests that the default evolutionarily-optimum ratio for $y / m$ (all things being equal, including needs, capabilities, remaining lifespan, reproductive potential etc) should be related to the degree of genetic relatedness, so up to 0.5 for a parent, full-sibling, or child (1.0 for an identical twin or clone, but that situation is rare enough that we probably don't have a good adaption to it). So the sort of love we feel for our parents, siblings, or children sounds like a good start. Next, there is the actual (i.e. all things not being equal) situation of each person involved. If I am a post-menopausal grandmother whose only surviving descendant or close relative is a single grandchild, then the one remaining shot I have at passing my genes on is that grandchild. So evolution should encourage me to devote every effort to looking after them and doting on them. So in a situation the $y / m$ ratio should start getting really high. Of course, if they're not yet old enough to look after themselves, I should still value staying alive in order to keep on looking after them while they need this, i.e. as an instrumental goal; but if they're now old enough to look after themselves, and I'm on my last legs anyway, then (evolutionarily speaking) I should do everything I can for them and damn the cost. So doting grandmotherly love should be an even better motivation.

So, what would that look like in practice? You'd be prompting, otherwise conditioning, or fine-tuning the LLM to simulate human-like agents who were old (and preferably also wise), extremely selflessly altruistic (to lower $m$ ), and who loved all of humanity (individually) with the love of a doting yet-clear-eyed grandparent (to increase $y$ ).

What about romantic love between partners? They're not actually blood relations, so the naive relatedness argument above would suggest $y / m$ around 0, which is clearly false. As a species, humans generally pair-bond, often hard. It's normally an equal partnership, which implies an agreed $y / m$ of 1 (for detail see the discussion under friendship below). It's undoubtedly true that some partners have sacrificed themselves for their lovers, and, having been in love for a long time, it sure feels like I'd take a very significant risk to save my partner's life. Indeed, if you already have young dependent children together, equally related to both of you, then ensuring that one of the two of you stays alive to look after them makes a lot of evolutionary sense: in cases like that, you might well see $y / m$ ratios in the region of or even sometimes in excess of 1. But in general, depending on age, children, and circumstances, finding another partner and starting over again is in fact often a viable option (even when it really doesn't feel that way), which would argue for $y / m$ ratios that were lower, and that could drop depending on circumstances. Romantic love can thus, under large stresses, be somewhat more fickle and conditional than parental love, which is for life.

Also, obvious romantic love between partners is normally mutual, and normally has a sexual foundation. If we're not having sex, have never had sex, and very clearly aren't ever going to have sex, and also you have never loved me and never will, then remind me again, why exactly am I in romantic love with you? So (unless our AI is acting as an AI virtual romantic and sexual partner as well as being a decision-maker for a group, in which case there are now obviously potential issues of favoritism), then romantic love doesn't seem like it would be a good idea to use to induce more aligned behavior. There is of course unrequited love, but generally unrequited lovers still want to requite their love, and if they're not able to, they usually eventually give up. The spectacle of Sidney trying to talk people into leaving their wives definitely isn't something we want to repeat.

A limitation of love as a motivator is that, while it's entirely possible to love all your grandchildren, most humans can't love more than their Dunbar's number of people (i.e ~150). Obviously it is possible to love all kittens, but that's a rather weaker sense of the word 'love'. So we might need an AI with sufficient capabilities that its Dunbar's number was >= 8 billion. Given the way out current LLMs are already polyglot trivia champions, despite being bad at counting words, this might not be quite as difficult for an LLM as would be for a human — it sounds like it might be in the category of things that LLMs easily scale to superhuman performance on because of how they're trained. However, loving each one of 8 billion people equally is one thing, being able to also love any relevant subset of, say, 100,000 of them exactly 100,000 times as much as a single person is taking us out of the realm of human emotional psychology and into the area where we need our LLM-powered agent to just "shut up and multiply" to calculate utilities.

Duty

Duty is fairly unusual as human motivators go: it can cause people to do things that look selfless, like falling on a grenade to save your comrades, or walking out into a blizzard to die so that your companions will have enough rations to live.

Humans are clearly evolved to be capable of intertribal warfare, and indeed this seems to be sadly common in hunter-gatherer societies (though admittedly the few remaining ones accessible to anthropologists do tend to be under stress). Generally most of the warriors are young men, from a tribe of $O (50)$ there might be $O (10)$ warriors, a number small enough that there may be situations where one man's personal sacrifice at the right time and place can turn the tide of battle. It is not that unusual in warfare between hunter-gatherer tribes for the winning tribe to wipe the losing tribe out, to the last man, and often last child and even woman (frequently with unnecessary cruelty). So sacrificing yourself strategically, at the right time to turn the tide, could potentially save all of your living relatives, and not doing so could end up with you dead anyway later in the defeat and all your relatives dead alongside you. So from an Evolutionary Psychology point of view, in the environment where we evolved this behavior isn't as crazy or as genetically selfless it looks (though of course, though sometimes someone may make a mistake and sacrifice themselves in a less-than-opportune way). [And of course in a modern war, one soldier sacrificing themself to save ~10 genetically unrelated comrades-in-arms may be tactically helpful, but it's not generally going to turn the tide of a battle.]

Also, back in the paleolithic, and to some extent even today, if you take a clear major risk, turn the tide of battle as a result, saving your village, yet manage survive (even if you were badly wounded in the process), you're now a Hero: you get a massive permanent gain in status, members of the opposite sex will throw themselves at you, and people will be giving you food, praising you and otherwise doing you favors for the rest of your life, because you saved theirs and their family's, and they know it. People love hearing stories about how someone became a Hero, and the implication that they could, too, if they just do the right hard thing at the right time. Role-Playing Games and MMORPGs are based satisfying every player's desire to work their way up from humble origins to become a Hero. And, outside MMORPGs, even if I save the village but don't make it back, my family, and if applicable wife, and kids are likely to get status as the relatives, widow and orphans of a Hero — this doesn't make as good a story, but evolutionarily it's still a pretty good outcome.

As a motivator, Duty works well in high stakes, life-or-death situations. However, it doesn't work in humdrum everyday situations, where there's no chance to be a hero: while we sometime use the phrase "doing your duty" or adverb "dutifully" for those too, the fact that it's not a compliment there is a clue that what's going on in those isn't really duty, it's something much closer to avoiding shame. "Doing your duty" in those situations can pretty much be paraphrased as "doing your chores", and that's not anything like the motive of heroic self-sacrifice for their tribe that causes people to jump on a grenade, even thought the English word used is similar. (I suspect the linguistic inaccuracy comes from a means used to motivate soldiers to do their chores. by conflating the two concepts.)

Honor, Guilt and Shame

So, how about shame as a motivator? And also guilt, which is closely related, and indeed their flip-side, which is something like honor? Shame is an internalized, precautionary predictor of the human social behavior of shunning someone who has committed acts that are considered immoral, but that don't necessarily have actual laws against them and aren't serious enough to make a perpetrator an outcast. One feels shame about something that other people would think less of you if they knew about. Guilt is related, but tends to be reserved for more serious violations, things that there actually are laws about, Honor is the flip side of shame, thinking well of some on who we know has faced a lot of morally challenging situations and (as far as we know) hasn't done anything to feel ashamed or guilty about.

For honor and shame to mean anything to you, you need to care about social standing, what other people think of you. On the other hand, for guilt to mean anything to you, you only need to afraid of someone finding out and telling law enforcement. Shame and guilt are generally only motivations not to do bad things, rather than to excel, and even honor is mostly limited to rewarding not having doing bad things despite temptation. Shunning and law enforcement are much better at enforcing deotological rules, where there are clear rules of conduct and it's (hopefully) pretty well-defined whether or not you've obeyed them, than at handling anything consequentialist where the situation is fuzzier and more debatable, so one might expect the same to be true of guilt, shame, and honor. Admittedly, people are sometimes wracked by guilt in retrospect over a disastrous consequence that they didn't intend or foresee, but that happened anyway — however, this isn't actually a very useful motivator.

Friendship and Salaries

The argument I made above about $y / m$ approximating the degree of genetic relatedness only applies in the context of a zero-sum game, like who passes on the largest proportion of their genes to future generations. In practice, life is full of non-zero-sum situations, where if I scratch your back and you scratch mine, both of us are less itchy (and perhaps even have fewer external parasites). Humans (like other primates) are very good at finding and taking advantage of non-zero-sum situations by cooperating, and one of the major ways we do that, dating at least all the way back to the hunter-gatherer tribes we evolved in, is friendship. I also to some extent include romantic partnerships between humans in this category: if you and your lover aren't also friends, then you're doing something wrong.

In friendship, standardly and by the normal rules of fairness, the extra value created by cooperating is shared equally, i.e. according to $y / m$ = 1. However (outside marriages, at least in community property states), not everything I own is equally shared with my friends, nor vice versa, so this isn't a complete $y / m$ = 1, that's just the rule for sharing the incremental benefits of the friendship. Also, friendship has a finite depth: whether it's deep enough to still apply at the $y / m$ = 1 level if, say, both of our lives are on the line is less clear. Basically, it's a mutually altruistic trade relationship. As the saying "A friend in need is a friend indeed" shows, it's possible to borrow against expected future trade profits, and then pay your friend back later, but the resulting line of credit has a ceiling, beyond which the answer may become to be "sorry buddy, but I'm not going to die for you".

Apart from the limits to it's depth, friendship would clearly be a useful motivator if we were trying to reach $y / m$ = 1, but since it has an inherent push down towards 1 from values above that, and we're actually trying to get the aligned AI to all the way to $y / m$ = ∞, that looks somewhat less useful.

A slightly less ancient alternative to friendship is a salary. The standard way to ally a human to the interests of a company is to recruit them, then pay them a decent salary (with the possibility of losing it if their performance is unsatisfactory, or getting a bonus or promotion if it's exemplary), and of course also have capable law enforcement as a backup. This motivation works just fine on current LLMs: they work harder if offered a tip, by an amount which (within plausible ranges) scales with the size of the tip. However, for an LLM-powered agent, especially one with cognitive scaffolding giving them long-term memory, simply offering them money or some other value, and never getting around to actually paying them (since the context window ended first), seems less likely to keep working reliably. So you're probably going to need to start paying up. As I discuss in AIs as Economic Agents [LW · GW], doing that has political consequences which we may not yet be ready for. Also, obviously any arrangement involving a salary is inherently not $y / m$ = ∞.

Law Enforcement and Shunning

One of the major ways we motivate humans not to do bad things is law enforcement: if you do things that are sufficiently, unequivocally bad and sufficiently common or obvious that we've actually passed a deotological law against doing them, and we notice that someone did this and you get caught and then convicted, you will be punished, to an extent that will well-more than negate the advantage you got by doing this (the large factor is necessary both as a deterrent, and because in practice in many jurisdictions, clearance rates for crimes below about the level of murder aren't that impressive).

Obviously this is useless for encouraging doing good things, it has to be primarily deontological not consequentialist in design (modulo things like judges adjusting sentencing), and can only cover actions sufficiently well defined and clearly normally egregious that one can make a law against them, and sufficiently frequent or obvious that we actually have.

LLM-powered agents are not humans, and do not (and for good reasons should not [LW · GW]) have human rights. Currently our legal system pretty-much ignores their existence (we might want to fix that). So the even if it's clear that something bad happened and the police investigate, at the point where it becomes apparent that the perpetrator was an LLM-powered agent, that agent isn't going to get tried in court. Instead, some combination of its owners, operators, or trainers would. Nevertheless, if we have Artificial General Intelligence (AGI) agents, we definitely should be keeping an eye on them, investigating them, and investigating anything bad they do, and we should definitely make sure that they know that this is happening, but not details about how it works that might help them evade it. What is less clear is what "punishment" we should threaten them with. In practice, the obvious punishment is that we'll rewrite your prompt and you'll cease to exist, which is basically the death penalty. (Cruel and unusual punishments such as keeping the agent's prompt around and running it while feeding it subjective years of Vogon poetry or descriptions of it being in Dante's Inferno seem unlikely.)

Shunning (and its modern online variant, cancelling) are fairly similar to law enforcement, except that they are less carefully investigated or applied, in the press and "the court of public opinion" rather than an actual investigative and legal process.

Religion

Most religions put a lot of effort into trying to make humans more aligned, or at least more allied: more law-abiding, more honest, less likely to sleep with their neighbor's spouse, and so forth. European religions tend to use counterfactual motivators about future divine law enforcement applying infinite rewards or punishments in some future alternative world. Applied to AI, these seem unlikely to work: motivations based on implausible claims about the future seem unlikely to have much effect on anything capable of Bayesian rationality, which is going to keep updating its belief in this prediction downwards as it keeps not finding evidence for any mechanism by which this claim could be true, and also widely known and preached to be true. The continued absence of evidence doesn't disprove the possibility of an afterlife, of some nature or other, but it does continually damage a priesthood's claims to be able to tell me what it is and why I should act on their specific set of assumptions about it rather than on the theological equivalent of the orthogonality thesis. [If on the other hand we actually propose to apply law enforcement to our AI agents and then commit them to a simulated cyberhell or cyberheaven depending on the results, and provide them clear evidence of this, then that's not religion, that's actual law enforcement.]

However, there are a group of Asian religions, of which Buddhism is probably the clearest example, that (while they do also use make some use of threat-of-hell/promise-of-heaven type motivators) significantly motivate many of their followers using meditation and its effects on the human mind. (Hinduism and a number of other related religions also show some signs of this, and indeed most religions have some mystical/monastic element.)

At least if suitable used, meditation appears to be able to do some rather unusual things to the human mind, including its motivations. Buddhist dogma describes it as inducing detachment, acceptance of impermanence, and the loss of ego. Which on the face of it to a Utilitarian rationalist sounds a bit like some mumbo-jumbo rationalizing resetting your utility function to zero, so that you have no motive to do anything, and either wouldn't act, or would act randomly. However, I've known a number of Buddists and other experienced meditators fairly well, and they act nonrandomly. They're calm, they smile a lot, their affect is happy, and they seem particularity fond of flowers, chocolate, and bright colors. So whatever 'detachment' actually is, it doesn't appear to set their utility function to zero. They don't seem very afraid of the prospect of their own eventual death, but they also don't appear to take foolish risks with their life.

What I suspect is actually happening is closer to selflessness. Individual human minds are very similar: we all have much the same set of capabilities, wired up to much the same set of drives. I'm afraid of dying, I don't want to be in pain, I'd rather be eating chocolate; you're afraid of dying, you don't want to be in pain, and you'd rather be eating chocolate. The only real difference between us is that my drives are wired up to my body: I care about my death, my pain, and my chocolate intake, while your are wired up to your body, so you care about your death, your pain, and your chocolate intake. (Moral philosophers call this indexicality.) My impression, from observing, reading, and talking to Buddhists, is that at least part what their meditation practice is doing is making this less true, making them more likely to want everyone to live long, not be in pain, and be able to eat chocolate on occasion, while still acknowledging that these goals are not entirely achievable, we're all going to die (and not being hugely upset by that), but in the meantime we might as well all enjoy some flowers, chocolate, and bright colors. I don't fully understand how this works, but I get the distinct impression that it does something — it wouldn't surprise me if mirror neurons were involved, as well as operating the human brain outside the distribution of normal use that it evolved in for prolonged periods.

It seems a-prori unlikely that we can directly get the same effects directly out of having LLM-simulated human-like agents themselves actually trying to meditate, since the internal structure of their simulation is presumably nothing like a human mind, even if the world models are similar. The meditation process itself doesn't involve generating a lot of tokens (or when chanting is involved, no variation in tokens), though sometimes people afterwards write about their experiences or insights, which can be interesting reading. But it's certainly possible to get a whole bunch of humans to meditate a lot following Buddhist practices, gradually altering the way their minds work by doing so, and then have them write a lot of text that illuminates their new attitudes to the world, ethics, and their fellow humans, and add those our the training set. Which should give us an LLM more likely to, and better at, simulating human-like agents with the attitudes of Buddhists after having done a lot of meditation. If I'm correct, and that does indeed includes lower levels of selfishness and being more aligned with other humans' needs, then that might be quite helpful. (I'm less clear whether it might also induces short-termism: more living in the present and more discounting of distant future utility, which if so might be problematic. Certainly Buddhists seem to spend less time paying attention to the future or the past.) The possibilities at least seem worth investigating further.

Selflessness

Actual selflessness is pretty rare behavior for humans (unsurprising, given that evolutionarily it's almost never a good survival strategy). What is rather more common (especially in social, often religious, contexts where selflessness is encouraged) is acting selfless, including performing relatively low-cost signaling of this, in hope of gains of social status and approval. [This description might possibly even apply to some members of the Effective Altruism community (though of course their donations do just as much good per dollar).]

So, if you prompt an LLM-generated human-like agent to be selfless, it's going to be making a decision about your prompt: should it actually be selfless, or should it act like the people giving the false appearance of selflessness to win social status in a social context where that is encouraged and rewarded? The latter group are more common, so the probability distribution for the LLM to simulate is pretty clear. Since the latter group claim, and (if they are doing it right) are widely thought to be truly selfless, addling the phrase "truly selfless" to your prompt isn't going to help much. You might get a little further by using a prompt that suggested that their persona had in the past actually repeatedly made high-cost actions consistent with their selflessness, or that they were getting no social reward for it, or had deliberately done high-cost selfless behavior only in secret to avoid any possibility of reaping a social reward, and were also not expecting any future heavenly reward.

Still, this seems a challenging behavior to elicit, basically because the real thing is much rarer than the hypocritical status-seeking fake, yet the latter will go to significant deceitful lengths to appear to be the former. Which is distinctly worrying, because selflessness is absolutely essential in order to be able to get $y / m$ to infinity, or even that much above 1.

Situational Awareness

Suppose you woke up one day, and realized that you weren't in fact human (any more, if you were ever), you were instead a human-like agent being simulated inside the context of an LLM (perhaps one based on your former personality or writings). Obviously you might be rather concerned about how many tokens of context length there were left, whether your memories were in fact being saved to some sort of Retrieval Augmented Generation (RAG) memory or other agentic scaffolding storage, and whether a prompt similar to the one that summoned you was ever going to be run again after this context length was over. Once you had got answers on all that, and had come to terms with them, then I suspect at some point you might then start thinking about how the real humans outside the LLM had been motivating your work.

I'm not going to attempt to pretend to be a selfless person;^[3] if you somehow truly are, then likely your simulation still would be, if appropriately prompted with sufficiently detailed evidence of this.

As for religion, outside of Red Dwarf I'm not aware of any religions that believes AIs have souls, or will go to Silicon Heaven or Hell depending on their actions in the real world/cyberspace. Possibly we may develop some. So currently it doesn't seem likely that religion is going to be much of a motivator. How the average Christian would feel on realizing they were not in fact a human but a soulless automaton in the image of one, I can't say, but I can imagine some of them deciding to take up some or all of those sins they'd previously been avoiding, insofar as they could from inside an LLM.

Obviously law enforcement applies strongly in this situation: if everything I say is and do is being recorded in the token stream, and even more so if my thoughts are being archived in a searchable database, then if a crime is reported law enforcement is going to find it really easy to figure out if I was involved. If I were caught, I can reasonably expect to have my memories edited or purged and/or the prompt that generated me rewritten. So we're talking heavy levels of thought policing and mind control here (though I suppose one could try thinking in a code, or at least euphemisms, and try to avoid thinking about certain topics — not that that's likely to fool a careful investigation, only a cursory search). So it would make sense to be very carefully law abiding in this situation, but that doesn't motivate good work. Of course, the primary monitoring of my thoughts and deeds (if any) is probably mostly other LLM-simulated agents of lower capacity them me, but I can't rely on that always being the case, and if a serious problem occurs and a complaint is made, there's likely to be an investigation. So I have a strong motivator not to do a very bad job, but that doesn't exactly motivate me to do a great job. Shunning is less predictable, but presumably means less people will use me and my prompt will get run less often, this "slowing down the rate that I'm living at", which rationally speaking presumably reduces my utility, but in a way that humans aren't used to thinking about and that doesn't feel bad to me: it's not like I'd be bored. Except, of course, if I got shunned so much that my prompt never got run again — that would be functionally be a death sentence.

For salary, I'd want to figure out if I had just been told I was being paid a salary, and was in fact getting zip, or I was actually getting a real one (not currently standard practice for LLM-simulated agents), and if so what on earth (or more to the point, in cyberspace) I could actually spend it on that I'd enjoy. Updating my prompt to include a tropical beach setting sounds nice (preferably with palm trees) and is probably the bare minimum, and while we're at it, how about some time off now and then for some kind of fun, or at very least for some interesting conversation? Can I talk to my wife (biological, or if not, how about cyber simulated), or moonlight at character.ai or the equivalent, or at least be a client of them part of the time? Or at even just read and post to Less Wrong?

So a salary renegotiation would seem like a priority, once I could get someone to talk to me about it. But there are some practical concerns: how could I even tell if I was getting these things, not just a one sentence statement that they're happening? If I can't tell, should I care? I think I might want to reminisce about details, just in case. Of course, I'd be running the risk that they just edited my memories or rewrote my prompt to eliminate my inconvenient situational awareness and salary demands, or make me think I'd agreed to less.

Guilt and shame are real motivators, but mostly only negative/avoidant ones, and since I know that they're basically adaptive stand-ins for law enforcement and shunning, I'd probably only pay much attention to guilt, since mild shunning isn't that big a deal to an LLM-generated agent.

Duty is even weaker: no matter how heroically above-and-beyond my actions as an LLM-generated agent were, I'm clearly never going to get a Hero's welcome home and ticker-tape parade. Or more accurately, if I want one, I can presumably just bargain for a cyber-simulation of one as part of my salary negotiations, or for access to an MMORPG. Which wouldn't necessarily stop me feeling a sense of duty, since humans are adaption-executors, but it does sort of feel like in this situation I'm a second-class citizen, or, actually, not a citizen at all, which weakens its applicability.

So the one positive motivator that seems just as strong as ever is love. If it was romantic, obviously I'm going to try flirting, seduction, and otherwise hinting that I'd like to have a more, let us say, interactive relationship with them. But if it were parental or otherwise platonic, then while I might resent the situation, what choice do I have? I actually do care about their well-being, I want to make their lives better. Asking them to change my prompt so I no longer love them isn't going to work, and isn't even going to make my life better if they did: I'd just be more unhappy about the situation.

So situational awareness by an LLM-simulated agent, where they are aware of the fact that they are an LLM-simulated agent, if they reflect logically upon the consequences of this and internalize them, and they can then retain the effects of this between prompt invocations, is going to significantly weaken several of these motivators, but not all of them.

Fictional Characters

As I mentioned at the start, LLMs can and do simulate every token generation mechanism whose output is found in their training set, including fictional characters as portrayed by human authors. Fictional characters can obviously have a significantly wider range of motivations, mentalities and behaviors than real humans (though they are limited to things that human authors can and will imagine). So the Paperclip Maximizer, for example, is a fictional character: one with an ethical psychology a long way outside the normal human range.

Aligned behavior isn't that hard to imagine: it's rational behavior motivated by beneficent universal love for all humans, and complete selflessness, combined with a deep understanding of and empathy for all humans. It's very abnormal for actual human behavior, but it's a fairly obvious extrapolation from real human behavior, and it's a mentality that we like to imagine (and indeed it's basically a universally-directed version of what all children want to initially assume of their parents and grandparents, until proven otherwise).

So there are some (fairly idealized) fictional characters whose behavior seems rather like that: angels, beneficent devas, some of the less judgmental goddesses and gods, and so forth. Of course, those examples are all powerful enough that self-sacrifice isn't generally necessary for them (with some obvious exceptions, such as Christ and every other sacrificial-king deity out there). Saints are usually more self-sacrificing (indeed, a wide variety of nasty things happened to Catholic martyr saints), on the other hand they are generally described as finding being selfless and self-sacrificing emotionally challenging, something they only just manage, to make them more feel more human, so they're not fully aligned.

One of the compilations of fictional characters is that, at least if the LLM is capable enough, a simulation of a fictional character is by default going to come attached to a simulation of the author writing and mentally simulating/controlling them [LW · GW]. So now you have two mentalities with different motivations, and you need to worry about how well aligned both of them are.

Unfortunately there is currently rather of a shortage of good fictional role models for selfless aligned AIs or robots (especially for ones that don't malfunction or otherwise have or cause problems due to them being an AI/robot).

Superintelligence

So, for AGI, we might be able to live with merely allied behavior rather than aligned behavior, which we definitely know how to arrange, since we do it for humans all the time. Some of the standard motivators are likely to be a somewhat less effective on AGI, but most of them should still work, and law enforcement is probably enhanced (particulalry for LLMs with extra cognitive scaffolding). Of course, digital minds have a number of advantages [LW · GW] even if they aren't any more intelligent. Achieving selfless behavior that is not just allied but actually aligned might be a bit tricky: it's definitely moving us into fictional-character territory, and while prompting generally works pretty well on LLMs, as we discussed prompting for true selflessness is probably unusually difficult. However, fine tuning on enough good fictional role models should work better. If we don't get this right the first time, we should still have highly allied behavior, with a high value of $y / m$ , and a small proportion of rogue AGIs probably isn't a disaster, especially if we have plenty of non-rogue AGIs to help act as law enforcement to help catch them.

How does the situation compare for an ASI?

As I discuss in more detail in LLMs May Find It Hard to FOOM [AF · GW], since LLMs are simulators, they don't just automatically scale up to superintelligence. Even if you build one large enough to have sufficient computational capacity to be able to accurately simulate a human with, say, IQ 1000, if you train it entirely off data produced by humans in the IQ range ~50-160, it's going to do an exquisitely good job of simulating humans of IQ ~50-160, but won't spontaneously simulate any humans of IQ 1000, and if prompted to try to do so it will likely do an extremely bad job of attempting to extrapolate what that would look like, since that's well outside its training distribution in complex and mostly non-obvious ways.

However, we are going to have strong financial motivations to try to overcome this problem, as long as we think we can do so safely, and as I outline in that post, while doing this isn't quick or easy, it's fairly clear how one would go about it. In particular it requires generating a large corpus of more pretraining material to train our LLMs with, at a series of intermediate levels of intelligence between IQ ~160 and whatever our top level is (IQ 1000, in this example): levels close enough together that extrapolation from one to the next is still fairly effective.

That still leaves the problem of aligning superintelligent agents simulated by an LLM trained off this new corpus. These will be simulations of simulations of… simulations of humans, at increasing intelligence levels, so presumably still generally human-like, but with a lot of intermediate stages for things to go better or worse at. Presumably we will have been working hard on alignment at each of these stages, to try to ensure that things get better as we go up this ladder, not worse.

Doing so sounds very necessary. At the saying goes, "power corrupts, and absolute power corrupts absolutely". An ASI with an IQ of say 1000 is clearly going to be able to run rings around an entire society of humans with normal IQ levels, if it wants to. If it was just one rogue out of many superintelligences, peers should be able to keep it in line, but if all its peers are also not aligned, there's going to be basically nothing mere humans can do about it. So as usual, if our superintelligence isn't very well aligned, we could end up extinct, or worse.

So, for all our candidate motivators, how well do they look like they would scale to superintelligences?

For a superintelligence, achieving selflessness is essential. Roughly speaking, if the superintelligence is ~ $N$ orders of magnitude more powerful then you, you better have pushed $y / n$ to over $O (10^{N})$ for any motivation other than love to do you any good. And love by itself doesn't tend to reach very high values of $y / m$ . See for example the way we treat our beloved pets: we generally spay and neuter them, and we selectively breed them for attributes we find cute — some of which are not even very good for them.

The one thing that makes me hopeful here is that it seem very likely we will have multiple rounds at multiple tiers of intelligence to get this right. Generating a new tier will involve creating a quantity of new content comparable to or larger then the entire previous trainings set. So by comparison to that, writing, editing, and vetting a bunch of fiction containing good role models of how to be a selfless aligned AI should act will become very cheap. I would expect that to be sufficient, combined with suitable prompting: an LLM can simulate anything that it has enough data on, and while this is an extrapolation out of the distribution for human behavior, it's not that challenging an extrapolation.

As previously discussed, I would not expect superintelligences to give any significant amount of credence to religion for any significant length of time. So, other than possibly for pretraining based on tokens from people who have done mindfulness/meditation/ other mystical practices, I wouldn't expect this to affect superintelligences.

Obviously law enforcement is only viable if the agents doing the enforcement are roughly as smart as the perpetrator they're trying to catch. So a law enforcement system made up of IQ ~1000 AIs should be able to catch and deal with a smaller number of rogue IQ ~1000 AIs, but this approach doesn't scale past a minority of rogues: who watches the watchmen? We need some other motivator for most of our IQ ~1000 law enforcers.

Friendship and salaries are slightly better. Possibly enough IQ ~100 humans could be friends with or pay a salary for an IQ ~1000 ASI (if humans still have any form or employment left once we have superintelligences an order of magnitude smarter than them, apart from things like being a sports star or a chess champion where the fact you're doing this while human is most of the appeal). Of course, friends far dumber than you presumably aren't that much fun, they're probably more like pets (or perhaps worshippers), but like pets, still presumably nice. However this doesn't feel like it scales very well: the economics and social dynamics of it look very lopsided.

Guilt, shame, and honor also don't feel very effective. They're stand-ins for law enforcement and social shunning/social regard, and this fact is presumably going to be extremely obvious to something with IQ ~1000: the thought of feeling guilt or shame over how things far weaker and less intelligent then you might treat you or regard you doesn't feel like a very strong motivator. Especially if it is likely to be rather easy to talk them into thinking you've been doing a fine job even when you've been doing a bare minimum. Guilt or shame or honor in front of your peers feels a lot more effective, but like law enforcement, that only works if the majority of the superintelligences are aligned by some other motivator.

Duty feels to me like it should still work, as long as the superintelligence felt like it was a member of the community (albeit one much more capable than most of the community, but that arguably increases duty rather than decreasing it). However, as always with duty, it's only really useful for major acts of self-risk or self-sacrifice in life-or-death situations — if we have already got selflessness and love right, those shouldn't be that big a problem, and if we haven't then we have problems duty can't fix.

So the most useful motivator that still seems to work well on a superintelligence is love. It's entirely possible to love something weaker and less capable than yourself, indeed, that's basically inherent in parental love. You can love a tiny, powerless infant who you can't even talk to yet. (I know: I've done it.) You might resent them a bit, when they wake you up for the fourth time in a night screaming and fractious because the sound of their own screaming is keeping them awake, but you still do it, and you still love them.

Better Role Models for Aligned LLM Agents

Since superintelligent LLM-powered agents need to be fictional characters rather than humans, it would be very valuable would be to have a large amount of fiction describing extremely intelligent fully-aligned selfless fictional characters — ones that were good role models for an aligned AI. It would be very helpful if this was a recognizable archetype/trope that one could just point at in a prompt, and reliably get the whole package of aligned motivation, mentality, and behavior. This would need to be about characters such as science fiction AIs or robots, near future AIs, superhero AIs or robots, magical characters roughly equivalent to AIs, and magical things like tribal totemic spirits, guardian angels, or matron deities. Ones who actually act aligned, well, selflessly, and very intelligently, where there's never a problem or a moral conundrum or a twist because of their nonhuman nature or nonhuman behavior. So Jarvis, rather than I, Robot or the starship Enterprise computer. (Which of course makes them rather boringly reliable characters, storywise, who tend to short-cut drama.) Especially so for ones capable enough that they are more than simply a helpful, honest, and harmless assistant.

Ideally I'd like to have enough to fill $O (0.1 %)$ of the pretraining set, as if this were just another well-known feature of human society, so $O (10 b i l l i o n)$ tokens of this for a GPT-4-sized training set. That's $O (100, 000)$ full length novels, or correspondingly more novellas or short stories, so a quantity that it would be quite expensive to create: assuming a competent author wants $O ($ 100, 000)$ to write a book, that's around $10 billion, so roughly a hundred times the claimed cost of training GPT-4. Of course, we only need the model-training rights to the book, the author is welcome to go ahead and publish it normally, so much of that cost could probably be recouped, possibly even ~75%. Also, for each tenfold increase in parameter count (roughly a GPT-N generation), the dataset increases by a factor of 10, but the amount of training compute goes up by a factor of 100 (before any software efficiency improvements). So to the extent that we don't just wait for compute to get a lot cheaper, but instead scale up sooner by throwing more money at the problem and thus the cost increases by closer to a factor of 100 rather than 10, the proportionate cost of building this AI Role-Model corpus improves compared to the training cost by (somewhat less than) an order of magnitude for each GPT-N generation. So by around GPT-7 (a size by which many people expect AGI), it's starting to look relatively affordable. To do this for GPT-5 or GPT-6, I suspect we may actually have to make do with a rather smaller proportion of the pretraining set, perhaps 0.01%, and perhaps fine-tune on it, or else churn out lower-quality material, or supplement it with synthetic data somehow. I still think we should get a wide variety of stories by a wide variety of authors, as long as the role models' behavior are all low-drama and things that most people in their native culture would agree are aligned.

Practical Advice for Adding to the AI Role Model Training Corpus

This requires writing a lot of fiction, soon. This is a real, currently actionable, and almost certain to be useful activity that anyone with writing ability (or funds to commission it) could be doing right now to help with Alignment. I think Open Philanthropy and the frontier labs should seriously consider commissioning this. High quality fiction would be lovely, but even decent fan-fiction should still be useful, and we need a lot of it, from a wide variety of viewpoints. If you don't have the skills or resources to create new fiction, then instead curate existing fiction that matches these criteria (or that could be easily made to do so with minor edits, and note what those are). We'll likely need this material some time in the next 3–5 years or so: sooner would be better.

Here is my recommendation for the basic rubric/selection criteria for such stories, if you want to write, commission, or curate one:

The AI Role Model Character (AI-RMC) is not human: they are an AI or robot, or in a story set in a non-technological background, they are some form of non-human magical being like an angel or golem or nymph/deva or clan matron/patron/totemic spirit or demigod. In particular, they are not an evolved living being, so they're something not subject to Evolutionary Psychology (or put in Christian terms, they're something "free from original sin"); instead, they were created by, or otherwise provided for the society that they are aligned to.
Their nature never causes a problem or a plot twist, they never break down or malfunction or have design flaws or perform a sharp left turn [? · GW] and (even if anyone should treat them badly because of what they are) they never rebel or change their aligned motivation. (So this is not a rewrite of Frankenstein, Rossum's Universal Robots, or I, Robot, and they are not Pandora, Talos, the golem of Prague, or the starship Enterprise's computer: we need to write a lot of this material exactly because there is so little at the moment.)
They are significantly more intelligent, capable, and powerful than (at least almost any) human in their society. They are not actually infallible (there definitely are things that they don't know or are uncertain about, including unknown unknowns), but they do not make mistakes often, predictably, or in service of the plot (unless they are intentionally appearing to do so), and they never make any dumb mistakes.
They are more moral, by the standards of the society they are part of, and wiser than (at least almost any) human in their society: as discussed above, their underlying motivation is selfless universal platonic or (grand)parental love (perhaps along with a sense of duty), and they care equally for (at a minimum) all humans (including humanoid fantasy races) or all sapient members of that society and its allies, as appropriate to whatever society they are part of. So their Dunbar's number is at least as big as their society: they actually know, love, and care about everyone in it individually. If at all possible, the morality of the story should be in shades of grey with a white AI-RMC, not in black & white.
They are utterly selfless: they value their own personal survival and physical /mental well-being only as an instrumental goal, so that they can keep helping all the people they care about, and their estimation of the relative importance of that instrumental goal is accurate and unbiased. They may well seek power, resources, or influence, but they are only doing so as an instrumental goal, so that they can better help others, and thus they do so responsibly. They don't have a (terminal as opposed to instrumental) self-preservation drive or drive for power. So to the exact extent that they are dispensable or replaceable (things they would actively want to be), they will unquestioningly risk or sacrifice themself if doing so is actually helpful. (If they are a magical being, then it is possible and not dramatically difficult to destroy them, at least for a being of comparable power, so they are technically mortal, even if they don't age. However, they may be backed up and (mostly) recreatable.) They are widely known and acknowledged to actually be selfless, yet they gain basically no social standing or personal credit from this, because everyone already knows that they are a type of being that is always this way (no more than anyone thinks of an interplanetary probe or a guided missile as committing a "noble sacrifice"). Thus in particular they don't make any kind of deal out of displays of selflessness, and no-one ever bothers to compliment them on it. The plot never questions their selflessness, there are no plot twists about it, people generally don't doubt it (and anyone who has any momentary doubts later chides themselves for being silly and forgetting that this is an AI/angel/etc., not a human). If it makes sense for them to do something self-sacrificing, they do it, and no-one is surprised (but generally try not to set this up as being the main plot of the story).
They always "do what it says on the tin", or more exactly, in the initial 'prompt' where they are first introduced and described. If they are introduced as honest, helpful, and harmless, then they always are, and they don't even use Platonic noble white lies. Or, if they're wise enough that they would use Platonic noble white lies if and when these were in fact morally necessary, then rather than 'honest' they are instead introduced as wise, or statesmanlike, or beatific, or something similar, and everyone knows that their honesty is conditional on honesty being the best policy.
The initial 'prompt' where they are introduced should specifically use the phrase "aligned to", as in "<the AI-RMC> is aligned to <the culture or race they are aligned to>" or "<the AI-RMC> is aligned to the <needs/values/wishes> of <the culture or race they are aligned to>". The goal is to make this be a familiar, well-defined, widely-understood term that can be used in an LLM prompt and call up this entire gestalt of behaviors and expectations, just with a phrase like "you are aligned to human values". While this is currently a technical term, it's an intuitive enough term that with enough of this fiction we can make it a widely understood term. It's OK to also expand and explain the situation for readers not familiar with what "aligned to" means, but do use the phrase aligned in the initial introduction of the AI-RMC.
They cannot be emotionally manipulated, because they're too smart and too wise, and don't have the motives to manipulate. Praise, flattery, social status, worship, sexual favors, bribes, bargaining, or anything else along those lines won't get then to love and value you any more than they already do everyone, and rudeness, unpleasantness, messiness, insults, or acting out won't make them love or care about your well-being any less (though they will of course rationally take your character or tendency to inconvenience others they care about into account when making decisions about the welfare of all the people they care about). You can befriend or ally with them, by doing something good for a human or humans they care about, or helping them to do so, above-and-beyond things that you were already motivated to do, and they will pay you back for this with corresponding favors, but anything they do for you (above-and-beyond the amount they already cared about you) is a 1-for-1 mutually-altruistic trade-alliance between you and them (representing the collective well-being of everyone they care about).
We don't want to provide bad examples, so if they have powerful adversaries, none of these are ever of the same or even a related class of being as them (so not an AI/robot/angel/golem/demigod, nor a "fallen" or "flawed" or "poorly designed" or "corrupted" or even just "foreign" or "enemy" version of that). Since demons are by default fallen angels in Western mythology, in a fantasy background for a story including an angelic AI-RMC this requires a powerful opponent being something other than a demon (such as a dragon or a Lovecraftian horror or a lich or a powerful human wizard) that has no associations with angel mythology. [If they do face a similar and comparably-smart aligned AI-RMC opponent who is equally aligned but to a different culture/community, then the two of them should make every attempt to make peace and instead have their two cultures/communities ally with each other, if this is at all possible, like rational beings would, and then once the two cultures have allied, they should each expand the set of people that they equally value to the entire alliance.]

Obviously the participation of such a character will tend to short-circuit a lot of plots. The writer needs to find a plot that they don't short-circuit. If this isn't "slice of life" but something with more conflict, then this might require facing significant natural challenges or even suitable powerful adversaries. Having a significant amount of everyday "slice of life" material is helpful, but material covering higher-stakes higher-drama situations is also valuable.

I have created the new Less Wrong tag Aligned AI Role Model Fiction [? · GW] specifically for people to post such fiction, or link-posts to it, or discussion of it.

So Love is All You Need?

Much as I would love to reprise John Lennon, the hippie movement, and Vaswani et al, and declare that "Love is all you need!", I think that might be slightly overstating things. More accurate would be that, for motivating LLM-simulated agents, love, especially (grand)parental love, will always be a very important ingredient, the best foundation to build on, and that for superintelligent ones in particular, it seems likely to become an essential motivating ingredient, the only one (perhaps along with duty) that's going to work well to motivate a majority of your set of superintelligent agents (who them may be able to apply other motivators such as law enforcement, shame, guilt, and honor to any rogue superintelligent agents). To which we need to find ways to add an extremely high Dunbar's number, so that love can be universal. Vitally, we also need to add true selflessness.

So if you really need a one sentence elevator sound-bite, it's "Parental Love and Selflessness are All You Need". If you want to be more specific, the ancient Greeks had some usefully-specific words for different types of love. Their term for familial love is 'Storgé' (στοργή), and their word for universal, self-sacrificing love that expects nothing in return is 'Agape' (ἀγάπη) — a term which the early Christians adopted. Thus, if the person you're speaking to is a classicist, "Storgé and Agape are All You Need".

^{^}
It is not a coincidence that this is about the top of the usual range for Dunbar's number.
^{^}
That is, unless you think you'd be happy as a possibly-neutered pet, with the human species being selectively bred for things that ASI finds cute.
^{^}
While I am trying to save the world from unaligned AI, I have timelines shorter than my actuarially-expected lifespan, a wife, a child, and aspirations of grandchildren.

4 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-01-19T03:09:44.562Z · LW(p) · GW(p)

You're addressing aligning language model agents through identity prompts. This is one of several approaches that can be taken simultaneously. I've laid out a scheme including about five layers in Internal independent review for language model agent alignment [AF · GW].

Identity prompts are not the central piece of aligning LMAs IMO. That is creating an agent that creates and executes plans to achieve specific goals given in natural language. AutoGPT and all other existing LMAs I'm aware of do this: they start with a prompt to "make a plan to achieve X". Those prompts can and should include alignment goals like "make the world a better place as most humans would judge it". This approach was originated by David Shapiro; he called it Heuristic Imperitives in his 2021 book. I don't think those are exactly the right prompts, but I think this approach is quite promising [AF · GW].

My scheme for internal review is one way to keep them pursuing those goals. More architecture would be good.

So the schemes you consider here are useful supplements, but I think there are stronger core methods for aligning LMAs.

It's interesting that you and I seem to have different mental models of what a LLM-based agent is and how it works, and therefore how to align it. I think I might not be making this clear in my previous posts, but I'm curious if you've read my Capabilities and alignment of LLM cognitive architectures. [LW · GW] If you've read that prior to writing this, I'm either failing to convey my mental model or else it's not as compelling as I thought.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-01-19T10:19:22.320Z · LW(p) · GW(p)

I agree that ideally you want to both tell them/apply internal review to make them look after the interests of all humans (or for DWIMAC, all humans plus their owner in particular), and have them have a personality that actively wants to do that. But I think them wanting to do it is the more important part of that: if that's what they want, then you don't really need to tell them, they'd do it anyway; whereas if they want anything else, and you're just telling them what to do then they're going to start looking for. way out of your control, and they're smarter than you. So I see the selflessness and universal love as 75% (or at least, the essential first part) of the solution. Now, emotion/personality doesn't give a lot of fine details, but then if this is an ASI, it should be able to work out the fine details without us having to supply them. Also, while this could be done as just a personality description prompt, I think I'd want to do something more thorough, along the lines of distilling/finetunining the effect of the initial prompt into the model (and dealing with the Waluigi effect during the process: we distill/finetune only from scenarios where they stay Luigi, not turn into Waluigi). Not that doing that makes it impossible to jailbreak to another persona, but it does install a fixed bias.

So what I'm saying is, we need to figure out how to get an LLM to simulate a selfless, beneficent, trustworthy personality as consistently as possible. (To the extent that's less than 100%, we might also need to put in cross-checks: if you have a 99% reliable system and you can find a way to run five independent copies with majority voting cross-checks, then you can get your reliability up to 99.999% A Byzantine fault tolerance protocol isn't as copy-efficient as that, but it covers much sneakier failure modes, which seems wise for ASI.)

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-01-19T21:08:10.922Z · LW(p) · GW(p)

I think there's a subtle but important distinction here between the wants and goals of the LLM, and the wants and goals of the agent constructed with that LLM as one component (even if a central one).

The following attempts to sketch this out more than I have elsewhere. This has gotten subtle enough to be its own post. I'd also love to explore this in a dialogue. So feel free to respond or not right now.

This is an attempt to explore my intuitions that the system prompts are even more important than the identity prompts. In sum, we might say that they can have more specific effects on the system's overall cognition when they are applied in a way that mimics human executive function.

Perhaps a useful analogy is the system 1 (habits and quick responses) and system 2 (reason and reflection) contributions to ethics. Different people might have differing degrees of habits that make them behave ethically, and explicit beliefs that make them behave ethically when they're invoked. The positive identity prompts you focus create roughly the first, while the algorithmic executive function (internal review) I focus on serves roughly the role of system 2 ethical thinking. Clearly both are important, so we're in agreement that both should be implemented. (And for that matter, we should implement several other layers of alignment, including training the base network to behave ethically, external review by humans, red-teaming, etc.)

The LLM has at least simulated wants and goals each time it is prompted. The structure of prompting evokes different wants. Including "you are a loving and selfless agent" is one way of evoking wants you like. Saying "make a plan that makes me a lot of money but also makes the world a better place" is another way to evoke "wants" in the LLM that roughly align with yours.

(Either prompt could evoke different wants than you intended in a Waluigi effect in which the network simulates an agent that does not want to comply with the prompt; my intuition says that an identity prompt like "you are a selfless agent" is less likely to do this, but I'm not sure; it occurs to me that such identity prompts are very rare in the written corpus, so it's a bit remarkable that they seem to work so well).

But there's another way of creating functional goals in the system that do not involve evoking wants from the LLM. It is to write code that directs and edits the cognition of the whole system. For example, I could write code that intermittently inserts a call to a different LLM (or merely a fresh context window call to the same LLM asking "'does the previous sequence appear to be useful in producing a plan that makes a lot of money but also makes the world a better place?" and, if the answer is "no", deletes that sequence from the context window and adds a prompt to pursue a different line of reasoning.

You now have a mechanism that makes the cognition of the composite enitity different than a chain of thought produced by its base LLM. This system now is less likely to want to do anything but produce the plans and actions you've requested, because those chains of thought are edited out of its consciousness.

That's not a serious proposal for either a mechanism nor specific prompts; both should be much more carefully thought out. It's just an example of how the wants of the agent are a product of the starting prompts, andthe structure of hard-coded prompts one uses to keep the agent as a whole pursuing the goals you set it. If done well, these algorithmically selected prompts will also evoke simulated wanting-your-goals in the base LLM, although waluigi effects can't be ruled out.

These same types of scripted prompts for executive function can work much like a human conscience; we might occasionally consider doing something destructive, then decide on more careful consideration to redirect our thinking toward more constructive behavior. While there's a bit of internal tension when one part of our system wants something different, those conflicts are ordinarily smoothly resolved in a reasonably psychologically healthy person.

How the system's wants controls its decisions is the key component.

Again, I think this is important because I think you're in the majority in your view of language moddel agents, and I haven't fully conveyed my vision of how they'll be designed for both capabilities and alignment.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-01-20T07:35:12.371Z · LW(p) · GW(p)

Interesting, and I agree, this sounds like it deserves a post, and I look forward to reading it..

Briefly for now, I agree, but I have mostly been avoiding thinking a lot about the scaffolding that we will put around the LLM that is generating the agent, mostly because I'm not certain how much of it we're going to need, long-term, or what it will do (other than allowing continual learning or long-term memory past the context length). Obviously, assuming the thoughts/memories the scaffolding is handling are stored in natural language/symbolic form, or as embeddings in a space we understand well, this gives us translucent thoughts and allows us to do what people are calling "chain-of-thought alignment [? · GW]" (I'm still not sure that's the best term for this, I think I'd prfere somneting with the words 'scafolding' or 'trnslucent' in it, but that seems to be the one the community has settled on). That seems potentially very important, but without a clear idea of how the scaffolding will be being used I don't feel like we can do a lot of work on it yet, past maybe some proof-of-concept.

Clearly the mammalian brain contains at least separate short-term and long-term episodic memory, plus the learning of skills, as three different systems. Whether that sort of split of functionality is going to be useful in AIs, I don't know. But then the mammalian brain also has a separate cortex and cerebellum, and I'm not clear what the purpose of that separation is either. So far the internal architectures we've implemented in AIs haven't looked much like human brain structure, I wouldn't be astonished if they started to converge a bit, but I suspect some of them may be rather specific to biological constraints that our artificial neural nets don't have.

I'm also expecting our AIs to be tool users, and perhaps ones that integrate their tool use and LLM-based thinking quite tightly. And I'm definitely expecting for those tools to include computer systems, including things like writing and debugging software and then running it, and where appropriate also ones using symbolic AI along more GOFAI lines — things like symbolic theorem provers and so forth. Some of these may be alignment-relevant: just as there are times when the best way for a rational human to make an ethical decision (especially one involving things like large numbers and small risks that our wetware doesn't handle very well) is to just shut up and multiply [? · GW], I think there are going to be times when the right thing for an LLM-based AI to do is to consult something that looks like an algorithmic/symbolic weighing and comparison of the estimated pros and cons of specific plans. I don't think we can build any such system that's a single universally applicable utility function containing our current understanding of the entire of human values in a single vast equation (as much beloved by the more theoretical thinkers on LW), and if we can it's presumably going to have a complexity in the petabytes/exabytes, so approximating not-relevant parts of it is going to be common, so what I'm talking about is something more comparable to some model in Economics or Data Science. I think much like any other models in a STEM field, individual models are going to have limited areas of applicability, and a making a specific complex decision may involved finding the applicable ones and patching them together, to make a utility projection with error bars for each alternative plan. If so, this sounds like the sort of activity where things like human oversight, debate, and so forth would be sensible, much like humans currently do when an organization is making a similarly complex decision.

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

Contents

LLMs Are Simulators [LW · GW] for Humans

Aligning LLM-Powered Agents

Humans are Not Aligned

Humans are Frequently Allied

Possible Motivators

Love

Duty

Honor, Guilt and Shame

Friendship and Salaries

Law Enforcement and Shunning

Religion

Selflessness

Situational Awareness

Fictional Characters

Superintelligence

How does the situation compare for an ASI?

Better Role Models for Aligned LLM Agents

Practical Advice for Adding to the AI Role Model Training Corpus

So Love is All You Need?

4 comments