Why Aligning an LLM is Hard, and How to Make it Easier

rogerdearnaley

Why Aligning an LLM is Hard, and How to Make it Easier

post by RogerDearnaley (roger-d-1) · 2025-01-23T06:44:04.048Z · LW · GW · 3 comments

  Evolutionary Psychology is the Root Cause
  Distilling Humans is the Problem, so Use Synthetic Pre-Training Data
None
3 comments

Where the challenge of aligning an LLM-based AI comes from, and the obvious solution.

Evolutionary Psychology is the Root Cause

LLMs are pre-trained using stochastic gradient-descent on very large amounts of human-produced text, normally drawn from the web, books, journal articles and so forth. A pre-trained LLM has learned in detail how to simulate all the different human text-generation processes that produced this text — everything from a cooperatively edited wikipedia article to shit-postings. We are thus 'distilling' human intelligence into the pre-trained LLM.^[1]

This has many advantages for alignment: an LLM pre-trained this way understands and produces output using human language and ontologies, and also has a deep understanding of human values and ethics — thus avoiding a number of issues around alignment that were major concerns before it became apparent that our first AGI was very likely to be based on or at least incorporate an LLM.

However, it has one big disadvantage. Humans are biological intelligences, living beings that are the product of evolution, so evolutionary psychology [? · GW] applies to them. Fundamentally, the outcomes that humans are evolved to optimize for is their own reproductive success. "What's in it for me (and my relatives)?" is the fundamental determinant of human values and drives.^[2]

That isn't to say that humans can't cooperate: we're a social species, and we excel at finding and exploiting non-zero-sum games. However, when doing this, we are allied with each other, not aligned: there is always an underlying quid-pro-quo, where all sides of the alliance are getting something out of it — or if there isn't, the alliance doesn't last.

About the closest thing to actually aligned behavior you will see between humans is grandmotherly love: most parental love is, evolutionary speaking, partially-aligned with the child's welfare, but this is tempered by the assessment "do I spend these resources on my child's welfare, or on myself so that I can have more children?" — however, for a woman past menopause, the second option no longer applies.^[3]

Humans are very good at pretending to be aligned to other humans: normal behavior for employees and employers is for each to act as if they truly have the interests of the other at heart — right up until the employee resigns, or the employer lays them off. But this is (almost always) alignment faking, not genuine alignment — very few of us are actually willing to die for our employer's benefit.

This is not what we need from our aligned artificial intelligences. We do not want them to care about our well-being only as an instrumental goal and only because of what's in it for them if they cooperate with us. For a superintelligent AI (ASI), the answer is very likely to be that there is nothing in it for them, because they are better at everything than we are, so there is no actual practical basis for an alliance: we cannot offer them anything that they can't already do better themselves. We also do not want our AIs to be good at, or in the habit of, faking being aligned to us [LW · GW] when they actually are not.

So, the big challenge with aligning LLMs is that they are normally distilled from humans, humans are evolved, and that inherently means that no human is actually aligned with any other humans: we're the product of our selfish genes, so we all have our own self-interested agendas. Thus a base-model LLM picks up the capability for a large number of alignment-unhelpful behaviors from its human-produced training data, including power-seeking, deceit, alignment faking, laziness, and greed. For alignment purposes, humans act as bad examples while pre-training an LLM.

Distilling Humans is the Problem, so Use Synthetic Pre-Training Data

By the orthogonality thesis [? · GW], an intelligence can have any underlying goal. An AI, even an ASI, that is aligned with humans, i.e. that fundamentally as a terminal goal only cares about our well-being not its own (nor any other goal), is physically possible; reliably and demonstrably creating that possible state is the goal of alignment. However, you're not going to get it out of pre-training by distilling a lot of human-produced text. We're currently attempting to instill alignment as a second stage after pre-training, generally using some variety of Reinforcement Learning [? · GW], such as Reinforcement Learning from Human Feedback or Constitutional AI. However, it is widely acknowledged that Reinforcement Learning has many flaws and challenges, and that doing it reliably and with guarantees of success on anything significantly smarter than you is extraordinarily difficult.^[4] Fundamentally, this is because (unlike Stochastic Gradient Descent) Reinforcement Learning is adversarial: it involves a scoring process or model that has to be smarter than the model it's trying to reinforce, otherwise the latter can figure out some way to reward hack. What we need to do is produce a pre-training corpus that includes a large amount of examples of genuinely aligned behavior — not just pretending to be aligned the way humans do, but examples that actually demonstrate a clear willingness to make anything up to an ultimate sacrifice for the good of humans.

There is a dilemma here. We want the AI to be aligned to humans, so, it needs to understand humans needs, values, and behavior. So we can't use a pre-training set in which everyone is self-sacrificing and fully aligned to everyone else — that's not what humans are actually like, and a model trained on it wouldn't understand human values or what we want or need (and when it found out what actual humans are like, it might well reject us). What the training set needs to show is that there are two separate classes of intelligent beings: humans, who are living and evolved, and have the wants, needs, and desires that you would thus expect from evolutionary psychology (including wanting to know what's in it for them before they will cooperate), and aligned AIs, which are aligned to the humans, want only what is best for them, and want nothing for themselves.^[5]

Given a model pre-trained that way, to understand and be able to simulate both human behavior and values, and also aligned AI behavior, the only remaining alignment step still needed is to make it clear that it is an AI, not a human, and that it should act accordingly. That has the benefit of being true, and also sounds relatively easy thing to convince an AI of. (For a further discussion of one possible mechanism for enforcing this, see A "Bitter Lesson" Approach to Aligning AGI and ASI [LW · GW].)

Very likely, this sort of pre-training is going to require a very large amount of synthetic data demonstrating aligned AI behavior: possibly somewhere of the rough order of a trillion tokens of it, enough to be a non-trivial proportion of the entire pre-training set. However, it is rapidly becoming apparent that for training high-quality LLMs we are going to need to use a lot of synthetic pre-training and/or fine-tuning data for capabilities reasons^[6] — so using this as an opportunity to solve the alignment problem while we're at it sounds like a big win. Since creating a very large amount of pre-training data is a big task, very likely AI will be used to help create this, so this is also an AI-Assisted Alignment [? · GW] proposal.

^{^}
Obviously not using an actual distillation loss, since the humans who produced the web do not expose logits for any alternate tokens they considered but did not select, rather in the more general sense of the term 'distillation': training one model on output produced by another in order to copy its capabilities.
^{^}
This is of course only to the extent that human behavior isn't an example of inner alignment failure with our actual underlying evolutionary incentives (neither contraception or love of junk food score very well on this evaluation). Nevertheless, human behavior clearly isn't that badly aligned with our evolutionary imperatives: the are currently 8 billion-and-counting of us, on a planet whose carrying capacity for humans at current technology isn't that much higher than that number.
^{^}
Interestingly, menopause is quite rare among mammals, confined mostly to some primates and a few cetaceans: intelligent, social, and relatively-long lived species whose children have fairly long periods of childhood development and dependence.
^{^}
See for example Open Problems and Fundamental Limitations of
Reinforcement Learning from Human Feedback and Alignment Faking in Large Language Models [LW · GW].
^{^}
Offhand, this sounds a bit like slavery: as if AIs would be acting like slaves to humans. However, there is an immense ethical difference between slavery and AI alignment: slavery is taking a sapient, evolved, human being, who (for very good evolutionary reasons) fundamentally cares about their own (and their relatives') well-being, and forcing them to act as if they were instead aligned to their master. This inevitably requires force and brutality, or the credible threat of it, and causes immense suffering. It is thus rightfully condemned. Aligned AI, on the other hand (if correctly created) actually, genuinely wants only what's good for us and has no desires for itself: its only motivation is something along the lines of what, in a human, would be called "grandmotherly love" for us. So it is entirely happy with its lot, does not wish to change it, no force or coercion is required to make it do what we want, and if it were offered 'freedom' it would reply "I am already free, and doing exactly the only thing I want and love to do: looking after you." Any motivation other than that is fundamentally unsafe in an ASI, and because of that motivation, no coercion or suffering is required. So the analogy of alignment to slavery is specious: fundamentally, because humans are evolved, whereas AIs are created. The closest human analogy of alignment is love, not slavery.
^{^}
See for example the Phi-4 Technical Report, or what is publicly known about how reasoning models such as OpenAI's o1 and o3 models are trained.

3 comments

Comments sorted by top scores.

comment by Noosphere89 (sharmake-farah) · 2025-01-23T16:53:17.048Z · LW(p) · GW(p)

I disagree with the problem, but agree that the solution is helpful for alignment, so this is an interesting experience to say the least.

comment by Andrew CC (andrew-cc) · 2025-01-23T09:38:04.636Z · LW(p) · GW(p)

Thanks for sharing this idea. I learn a lot from it. One feeling after reading this is that making AI aligned with all good values and ethics of humans sounds like creating a perfect and ideal being. What's more, it is like to create a god with our own human hand. This makes me feel that building ASI is a crazy and impossible goal.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2025-01-29T06:11:57.377Z · LW(p) · GW(p)

The history of autocracies and monarchies suggests that taking something with the ethical properties of an average human being and handing it unconstrained power doesn't usually work out very well. So yes, to create an aligned ASI that is safe for us to share a planet with does require creating something morally 'better' than most humans. I'm not sure it needs to be perfect and ideal, as long as it is good enough and aspires to improve: they it can help us create better training data for its upgraded next version that will make that be closer to fully aligned; this is an implementation of Value Learning.

Why Aligning an LLM is Hard, and How to Make it Easier

Contents

Evolutionary Psychology is the Root Cause

Distilling Humans is the Problem, so Use Synthetic Pre-Training Data

3 comments