What Is The Alignment Problem?

post by johnswentworth · 2025-01-16T01:20:16.826Z · LW · GW · 1 comments

Contents

  The Difficulty of Specifying Problems
    Toy Problem 1: Old MacDonald’s New Hen
    Toy Problem 2: Sorting Bleggs and Rubes
    Generalization to Alignment
    But What If The Patterns Don’t Hold?
  Alignment of What?
    Alignment of a Goal or Purpose
    Alignment of Basic Agents
    Alignment of General Intelligence
    How Does All That Relate To Today's AI?
  Alignment to What?
    What are a Human's Values?
    Other targets
      Paul!Corrigibility
      Eliezer!Corrigibility
      Subproblem!Corrigibility
      Exercise: Do What I Mean (DWIM)
  Putting It All Together, and Takeaways
None
1 comment

So we want to align future AGIs. Ultimately we’d like to align them to human values, but in the shorter term we might start with other targets, like e.g. corrigibility [? · GW].

That problem description all makes sense on a hand-wavy intuitive level, but once we get concrete and dig into technical details… wait, what exactly is the goal again? When we say we want to “align AGI”, what does that mean? And what about these “human values” - it’s easy to list things which are importantly not human values (like stated preferences, revealed preferences, etc), but what are we talking about? And don’t even get me started on corrigibility!

Turns out, it’s surprisingly tricky to explain what exactly “the alignment problem” refers to. And there’s good reasons for that! In this post, I’ll give my current best explanation of what the alignment problem is (including a few variants and the subquestion of what human values are), and explain why it’s surprisingly difficult to explain.

To set expectations: this post will not discuss how the alignment problem relates to the more general AGI-don’t-kill-everyone problem, or why one might want to solve the alignment problem, or what the precise requirements are for AGI not killing everyone. We’ll just talk about the alignment problem itself, and why it’s surprisingly difficult to explain.

The Difficulty of Specifying Problems

First, we’ll set aside alignment specifically for a bit, and look at a specification of a couple simple toy problems. In the context of those toy problems, we’ll explore the same kind of subtleties which make the alignment problem so difficult to specify, so we can better understand the challenges of explaining the alignment problem.

Toy Problem 1: Old MacDonald’s New Hen

Toy problem: for mysterious reasons, old farmer MacDonald wants his newest hen to be third in his flock’s pecking order. That’s the problem.

Remember, we’re not interested in solving old MacDonald’s hen problem! (It is an interesting problem, but not at all relevant to solve.) We’re interested in how the problem is specified. What does it mean, for the newest hen to be third in the flock’s pecking order?

Key fact: empirically, hens in a flock form a(n approximately) linear pecking order. What does that mean? Well, if Chickira pecks Beakoncé (i.e. Chickira is “higher” in the pecking order than Beakoncé), and Beakoncé pecks Audrey (i.e. Beakoncé is “higher” in the pecking order than Audrey), then Chickira pecks Audrey (i.e. Chickira is “higher” in the pecking order than Audrey) not vice-versa.

Another way to put it: if we draw out a graph with a node for each hen in the flock, and an arrow from hen B to hen A exactly when B pecks A, then we’ll find the graph (approximately) never has any cycles - e.g. there’s no “Chickira pecks Beakoncé pecks Audrey pecks Chickira” situation, there’s no two hens which both regularly peck each other, etc. As long as that’s true, we can arrange the hens in an order (the “pecking order”) such that each hen only pecks hens lower in the pecking order.

Pecking cycle; these hens have no pecking order.
These hens do have a pecking order.

Main point of this whole example: if the flock doesn’t form a linear pecking order - e.g. if there’s lots of pecking-cycles - then old MacDonald’s goal doesn’t even make sense. The new hen can’t be third in the pecking order if there is no pecking order. And the existence of a linear pecking order is an empirical fact about hens. It’s a pattern out in the physical world, and we could imagine other worlds in which that pattern turns out to not hold. 

On the other hand, so long as the flock does form a linear pecking order, it’s relatively easy to specify old MacDonald’s problem: he wants the new hen to be third in that order.

Toy Problem 2: Sorting Bleggs and Rubes

An old classic [LW · GW]:

Imagine that you have a peculiar job in a peculiar factory:  Your task is to take objects from a mysterious conveyor belt, and sort the objects into two bins.  When you first arrive, Susan the Senior Sorter explains to you that blue egg-shaped objects are called "bleggs" and go in the "blegg bin", while red cubes are called "rubes" and go in the "rube bin".

 

Once you start working, you notice that bleggs and rubes differ in ways besides color and shape.  Bleggs have fur on their surface, while rubes are smooth.  Bleggs flex slightly to the touch; rubes are hard.  Bleggs are opaque; the rube's surface slightly translucent.

As the story proceeds, we learn that the properties of bleggs and rubes are noisy: some tiny fraction of bleggs are actually purple or even red, some tiny fraction of rubes have fur, etc. So how should we sort these unusual cases?

It’s useful to picture the bleggs and the rubes as two clusters (though really they’re in a much higher dimensional space than this 2D visual):

Two clear clusters, corresponding to “two types of things”: bleggs and rubes.

If the distribution of bleggs/rubes instead looked like this, then the distinction between blegg and rube wouldn’t make much sense at all:

Only one cluster. There might be a useful continuous one-dimensional underlying property here, perhaps a “bleggness/rubeness scale”, but it’s not well characterized as “two types of things”. Another example: “hotness/coldness” is a useful continuous one-dimensional scale and is not well characterized as “two types of things: hot and cold”.

Main point of this example: if the bleggs and rubes don’t form two clusters - e.g. if there’s just one cluster - then our sorting job doesn’t even make sense. We could arbitrarily decide to sort based on some cutoff, but without the two clusters the cutoff wouldn’t really distinguish between bleggs and rubes per se, it would just be an arbitrary cutoff on some metric (and we typically wouldn’t expect that cutoff to generalize robustly to other unobserved properties of the items). And the existence of two clusters is an empirical fact about bleggs/rubes. It’s a pattern out in the (hypothetical) physical world, and we could imagine other worlds in which that pattern doesn’t hold.

So long as there are two clusters to a reasonable approximation, it’s relatively easy to precisely specify the sorting problem: estimate which cluster each item is in, then sort accordingly.

Generalization to Alignment

Generalizable lesson: most goals, or problems, require some patterns to hold in the environment in order for the goal/problem to make sense at all. The goal/problem is formulated in terms of the patterns. Most of the work in precisely specifying the problem is to spell out those patterns - like e.g. what it means for a flock of hens to have a linear pecking order.

Applying this lesson to alignment, we need to ask: what patterns need to hold in the environment in order for the alignment problem to make sense at all? As with old MacDonald’s hens, most of the work of precisely stating the alignment problem will be in specifying those patterns.

But when talking about the alignment problem, we’ll be playing on hard mode - because unlike e.g. pecking orders amongst chickens, humanity does not already have a very solid and legible understanding of the patterns which hold among either AGIs or human cognition. So we’ll need to make some guesses about empirical patterns just to state the problem.

An important thing to keep in mind here: humans had a basically-correct instinctive understanding of hens’ pecking orders for centuries before empirical researchers came along and precisely measured the lack of cycles in pecking-graphs, and theoretical researchers noticed that the lack of cycles implied a linear order.[1] Had 15th century humans said “hmm, we don’t have any rigorous peer-reviewed research about these supposed pecking orders, we should consider them unscientific and unfit for reasoning”, they would have moved further from full understanding, not closer. And so it is today, with alignment. This post will outline our best current models (as I understand them), but rigorous research on the relevant patterns is sparse and frankly mostly not very impressive. Our best current models should be taken with a grain of salt, but remember that our brains are still usually pretty good at this sort of thing at an instinctive level; the underlying intuitions are more probably correct than the models.[2]

But What If The Patterns Don’t Hold?

I recommend most readers skip this subsection on a first read; it’s not very central to explaining the alignment problem. But I expect a significant minority of readers at this point will obsess over the question of what to do when the patterns required for expressing a problem don’t actually hold in the environment, and those readers won’t be able to pay attention to anything else until we address that issue. So this section is specifically for those readers. If that’s not you, just jump onwards to the next section!

We’ll start with a few reasons why “patterns not holding” is much less common and central than it might seem at first glance, and address a few common misconceptions along the way. Then, we’ll get to the actual question of what to do in the rare case that patterns actually don’t hold.

First reason “patterns not holding” is less central an issue than it might seem: approximation is totally fine. For instance, in the blegg/rube clustering example earlier, even if there are more than literally zero true edge cases, more than zero items right between the two clusters… that’s fine, so long as edge cases are relatively rare. There need to be two clear, distinct clusters. That does not mean the clusters need to have literally zero overlap. Similarly with the hens: pecking cycles need to be rare. That does not mean there need to be literally zero of them. Approximation is part of the game.

Second reason “patterns not holding” is less central an issue than it might seem: occasionally people will argue that the system can be modelled well at a level of abstraction lower than the patterns, as though this somehow invalidates the patterns. As a toy example: imagine someone arguing that gases are well-modeled by low level molecular dynamics, and therefore concepts like “temperature” are unnecessary and should be jettisoned. That’s not how this works! Even if the system can be modeled at a lower level, patterns like e.g. thermal equilibrium still exist in the environment, and temperature is still well defined in terms of those patterns (insofar as the patterns in fact hold). I think this sort of thing comes up in alignment to some extent with e.g. the shard folks [? · GW]: part of what many of them are trying to do is formulate theories at a lower level than we do in this post, and they often reject attempts to explain things at a higher level. But even if models at the lower level of “shards” work great, that would not imply that higher-level patterns don’t exist.

Third reason “patterns not holding” is less central an issue than it might seem: the Generalized Correspondence Principle. When quantum mechanics or general relativity came along, they still had to agree with classical mechanics in all the (many) places where classical mechanics worked. More generally: if some pattern in fact holds, then it will still be true that the pattern held under the original context even if later data departs from the pattern, and typically the pattern will generalize in some way to the new data. Prototypical example: maybe in the blegg/rube example, some totally new type of item is introduced, a gold donut (“gonut”). And then we’d have a whole new cluster, but the two old clusters are still there; the old pattern is still present in the environment.

Ok, that’s three reasons why “patterns not holding” is less central and common than people often think. But it’s still possible for patterns to not hold, so let’s address the actual question: what should old MacDonald do if his hens in fact do not form a linear pecking order? What should the sorter do if bleggs and rubes in fact do not form two distinct clusters?

Two main possibilities. Either:

Maybe old MacDonald really wants the new hen to be third in the pecking order, so if the hens don’t form a linear pecking order, he’ll try to make them form a linear pecking order. Or, maybe he’ll reexamine why he wanted the new hen to be third in the pecking order in the first place, and figure out how to achieve his upstream goals in a world without a linear pecking order.

Alignment of What?

We’re now ready to explain the alignment problem itself, and the sorts of patterns in the environment required for the alignment problem to make sense at all. We’ll start with the broadest version: what kinds of things can be “aligned” in general, besides just AGI? What patterns need to hold in a system to sensibly talk about the system “being aligned” in the same sense as the alignment problem?

After those questions, we’ll talk about what kinds of patterns might constitute an “intelligence” or “general intelligence”, and what it would mean to “align” those kinds of patterns specifically.

Then, in the next section, we’ll move to the “align to what exactly?” part of the problem. We’ll talk about what kinds of patterns need to be present in human cognition in order for “human values” to be a sensible thing to talk about. Finally, we’ll briefly touch on several flavors of corrigibility.

Alignment of a Goal or Purpose

Let's start with the simplest kind of system for which it makes sense to talk about "alignment" at all: a system which has been optimized [LW · GW] for something, or is at least well compressed by modeling it as having been optimized. What patterns must the environment present in order for things like "having been optimized" to make sense, and what would "alignment" mean in that context?

Toy example: suppose I find a pile of driftwood on the beach, and somehow notice that the wood in this pile fits together very tightly and neatly into a sphere, like a big 3D puzzle. The fact that the wood fits together into a sphere is a very nontrivial "pattern in the environment" - specifically a pattern in the pile of wood. And it's a pattern which we wouldn't expect to find in a random pile of wood; it sure makes us think that the pile of wood has been optimized to fit together that way, whether by humans or by some strange natural process. Mathematically, I might conjecture that the pile of wood is approximately-best-compressed by a program which involves explicit optimization (like e.g. a "minimize" function or a "solve" function) for fitting together into a sphere.

Image
The pile, after I've assembled it.

What would "alignment" mean for that pile of wood? Well, insofar as the pile of wood is well-compressed by modeling it as having been optimized for something, it makes sense to ask what the wood seems to have been optimized for. In this case, it sure seems like the pile of wood has been optimized for fitting together into a sphere. Then the "alignment" question asks: is fitting-together-into-a-sphere something we want from this pile of wood? Is that goal aligned to our own goals, or is it orthogonal or opposed to our own goals?

Now let's generalize that toy example, and address some of the standard subtleties and open problems.

General recipe: we look for some part of the environment which is well-compressed by modeling it as having been optimized. That's the "pattern in the environment". Insofar as our chunk of the environment is well-compressed by modeling it as having been optimized, it makes sense to ask what it seems to have been optimized for. And then, we can ask whether the goal it seems to have been optimized for is aligned/opposed/orthogonal to our own goals.

Some standard subtleties and open problems of this recipe:

Note for philosophers

Yes, this section has basically been a formulation of teleology. Yes, I am aware of the usual approach of grounding teleology in biological evolution specifically, and the formulation in this section is somewhat different from that. The reason is that we're answering a different question: the grounding in evolution answers the question of "original teleology", i.e. how something acquires a purpose without being produced by something else purposeful (or in our terms: how something can be optimized without an optimizer which has itself been optimized). For our purposes, we don't particularly care about original teleology. Indeed, a focus on original teleology is often actively antihelpful when thinking about AI alignment, since it tempts one to define the purpose of a thing in terms of the original teleology (e.g. evolutionary fitness)... which risks defining away inner alignment failures [? · GW]. If e.g. one finds oneself arguing that the purpose of condoms is, ultimately, to increase genetic fitness, then one has made a mistake and should back up.

Alignment of Basic Agents

There are a few importantly-different useful notions of "agency". In this section, we'll talk about "basic agency" - not that the agents themselves are necessarily "basic", but that it's a relatively simple and broad notion of agency, one which includes humans but also e-coli and thermostats.

A neat thing about humans and e-coli and thermostats is that they do different things across a wide variety of environmental conditions, so as to cause their environment to appear optimized for certain things reasonably consistently. A thermostat, for instance, turns a heater on and off differently across a wide variety of starting temperatures and a wide variety of external temperatures, so as to hold a room at a certain temperature. An e-coli tumbles and swims differently across a wide variety of starting conditions, so as to end up near a sugar crystal placed in water. Organisms in general typically sense their environment and take different actions across a wide variety of environmental conditions, so as to cause there to be approximate copies of themselves in the future.[4] That's basic agency.[5]

Stated in terms of patterns in the environment: we're looking for a system which is well-compressed by modeling it as choosing outputs as a function of inputs, so as to make the environment appear optimized for a certain consistent goal.

Similar to the previous section: if a system is well-compressed by modeling it as blah blah blah certain consistent goal, then it makes sense to ask what that goal is, and whether it's aligned/opposed/orthogonal to what we want.

How does this relate to alignment of non-agentic stuff, as in the previous section? Well, a basic agent causes the environment to look like it's been optimized for something; the basic agent is a generator of optimized-looking stuff. So, when talking about alignment of basic agents, we're effectively asking "how aligned is the optimized-looking stuff which this agent tends to generate?". A thermostat, for example, tends to generate rooms in which the air has a specific temperature. How well does that align with what we humans want? Well, it aligns well insofar as the target temperature is comfortable for humans in the room.[6]

Again, some subtleties and open problems:

Alignment of General Intelligence

When I talk about "the alignment problem" or "the AI alignment problem", I'm usually talking about alignment of general intelligence. The alignment part doesn't actually involve any new pieces; a generally intelligent agent is a special case of basic agency, and everything we said about alignment of basic agents carries over. But it is worth discussing what kinds of patterns in the environment constitute "general intelligence" specifically, as opposed to the broader patterns of "basic agency", so we understand what "alignment of general intelligence" refers to.

I’ll quote from What’s General-Purpose Search [LW · GW][7] here: 

Benito [LW · GW] has an interesting job. Here’s some of the stuff he’s had to do over the past couple years:

  • build a prototype of an office
  • resolve neighbor complaints at a party
  • write public explanations of grantmaking decisions
  • ship books internationally by Christmas
  • moderate online debates
  • figure out which of 100s of applicants to do trial hires with

Quite a wide variety!

Benito illustrates an interesting feature of humans: you can give humans pretty arbitrary goals, pretty arbitrary jobs to do, pretty arbitrary problems to solve, and they'll go figure out how to do it. It seems like humans have some sort of “general-purpose problem-solving” capability.

That’s basically the idea of general intelligence. There are some kinds of agents which are able to solve a particularly wide variety of subproblems as they come up, including totally novel kinds of subproblems which they’ve never dealt with before. 

(One of the more common responses I hear at this point is some variation of “general intelligence isn’t A Thing, people just learn a giant pile of specialized heuristics via iteration and memetic spread”. There are two main classes of evidence that the “empirical iteration -> pile of specialized heuristics model” is basically wrong, or at least incomplete. First, on an outside view, we live in a very high-dimensional environment, so relatively brute-force learning can be ruled out on efficiency grounds; humans are empirically far too efficient to not have at least some general intelligence beyond just learning an arbitrary heuristic pile. See the posts on optimization [LW · GW] and science [LW · GW] in high-dimensional worlds for more on that. Second, on an inside view, researchers already know of at least some general-purpose heuristic generators [LW · GW]. While the known methods probably aren’t the whole story of general intelligence, they’re at least an existence proof: we know that techniques at that level of generality really do exist.)

So what patterns need to hold in the environment to talk about general intelligence/agency? Well, it’s the same as basic agency, with the added condition that the agent-pattern needs to be able to solve a very broad class of new kinds of subproblems as they come up.

Subtleties and open problems, in addition to those from the Basic Agents section:

How Does All That Relate To Today's AI?

As of this writing, OpenAI has been throwing around the word "agent" lately, but that's mostly marketing. The way the terms have typically been used historically, the simplest summary would be:

We haven't talked much about world models or planners or solvers in this post, because they require making more assumptions about an agent's internal structure (... or as-yet-unproven conjectures about that structure [LW · GW]), and I don't want to bring in unnecessary assumptions.

That said, some key points:

Applying all that to typical usage of LLMs (including o1-style models): an LLM isn't the kind of thing which is aligned or unaligned, in general. If we specify how the LLM is connected to the environment (e.g. via some specific sensors and actuators, or via a human user), then we can talk about both (a) how aligned to human values is the nominal objective given to the LLM[8], and (b) how aligned to the nominal objective is the LLM's actual effects on its environment. Alignment properties depend heavily on how the LLM is wired up to the environment, so different usage or different scaffolding will yield different alignment properties.

A final note: a sufficiently advanced planner/solver with a lot of background knowledge of the world could potentially figure out how its sensors and actuators are wired up to the world, and then optimize the environment in a consistent way across many different sensor/actuator arrangements. At that point, we could meaningfully talk about alignment of the planner/solver without the detailed context of how-it's-wired-up. But as of writing, I don't think that's yet very relevant to current AI.

Alignment to What?

We've now addressed what sort of patterns need to hold in the world in order to meaningfully talk about a thing being aligned, and specifically a basic agent being aligned, and even more specifically a general intelligence being aligned. It's time to address the other side of the alignment problem: aligned to what? In particular, in the long run (though importantly not necessarily as the first step!) we probably want to align AI to humans' values. What exactly are these "human values"? We often need to clarify that human values are not revealed preferences or stated preferences or dopamine levels or whatever other flavor of readily-measurable preferences someone likes. But what are human values, and what patterns need to hold in our environment in order for a human's values to be a sensible thing to talk about at all, separate from stated preferences, revealed preferences, dopamine, and all those other things?

What are a Human's Values?

We'll focus on the values of just one single human. How to generalize to multiple humans is... not an unimportant question, but a question whose salience is far, far out of proportion to its relative importance, so we're going to counteract that disproportionality by completely ignoring it here.

So, what do we mean when we talk about a human's values? And in particular, what patterns need to hold in the world in order for a human's values to be a sensible thing to talk about?

The big stylized facts we'll rely on are:

It is rather difficult to make these stylized facts play together. I know of only one class of cognitive model compatible with all of them at once. Hutter and Everitt called it Value Reinforcement Learning, though the name does not make the model obvious.

Here's how David and I have summarized [LW · GWthe idea [LW · GW] before:

In particular, the brain tries to compress the reward stream by modeling it as some (noisy) signal generated from value-assignments to patterns in the brain's environment. So e.g. the brain might notice a pattern-in-the-environment which we label "sports car", and if the reward stream tends to spit out positive signals around sports cars (which aren't already accounted for by the brain's existing value-assignments to other things), then the brain will (marginally) compress that reward stream by modeling it as (partially) generated from a high value-assignment to sports cars. See the linked [LW · GWposts [LW · GW] for a less compressed explanation, and various subtleties.

... so that's a complicated hand-wavy model. I think it's roughly correct, since it's pretty hard to explain the stylized facts of human values otherwise. But the more important point here is: there's this thing in the model which we call "values". And insofar as the model doesn't hold at least approximately for actual humans, then it probably doesn't make sense to talk about the human's values, or at least not any kind of "human values" which are very similar to the things I usually call "human values". At a hand-wavy level, that model is the "pattern which needs to hold in the environment" in order for the-thing-I-mean-by-"human values" to make sense at all.

(I should note here that lots of people claim that, when they talk about human values, they mean <other definition>. But in approximately 100% of cases, one finds that the definition given is not a very good match even to their own usage of the term, even allowing for some looseness and the occasional outright mistake in casual use. More generally, this problem applies basically whenever anyone tries to define any natural language term; that's why I usually recommend using examples instead of definitions whenever possible.[10])

One more key point: the extent to which Value Reinforcement Learning is in fact a good model of human cognition is, in principle, an empirical question. It should be testable. Empirical tests of the pattern were less relevant earlier when talking about agents and goals and optimization, because it's pretty clear that all the patterns we talked about there do in fact occur; the uncertainty is mostly over whether/how those patterns accurately summarize the things humans recognize as agents and goals and optimization. But now that we've moved on to human values, we see that our uncertainty is at least as much over whether the pattern holds for humans at all, as whether we've correctly identified the thing humans call "values" within the model.

Other targets

What about more medium-term targets - things which would potentially make sense as safe alignment targets before we've ironed out the kinks enough for alignment to human values to be a good idea?

There's a potential long tail of possibilities here, and I'm only going to cover a few, chosen mainly to illustrate the kinds of things we might need to deal with. These will also be somewhat shorter, largely because our understanding of them is relatively poor.

Paul!Corrigibility

Quoting Paul Christiano [LW(p) · GW(p)]:

I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural.

Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.

As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:

  1. Affirmatively tell you about the broken vase.
  2. Clean up the broken vase without notifying you.
  3. Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked.
  4. Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.

Let's say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won't notice anything wrong and so you would erroneously give it the best score of all. This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I'll call "corrigible" and the other near option 4 which I'll call "incorrigible."

In the language of this post: Paul is conjecturing that there exists a pattern... not exactly in our physical environment, but in the space of strategies-which-humans-give-positive-feedback-to-in-our-environment. And that pattern involves two reasonably well-separated clusters: one in which the strategy actually does basically what the human wants, and another in which the strategy fools the human. Insofar as there are in fact two separate clusters like that, Paul wants to call the non-trickery one "corrigible", and use it as an alignment target.

As with our model of human values, note that Paul's notion of "corrigibility" involves an empirical claim about the world (or, more precisely, the space of strategies-which-humans-give-positive-feedback-to-in-our-environment). It should be testable.

Eliezer!Corrigibility

Quoting Eliezer Yudkowsky:

The "hard problem of corrigibility" is interesting because of the possibility that it has a relatively simple core or central principle - rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI.

We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.

In the language of this post: Eliezer is conjecturing that there exists a general pattern in the ways-in-which highly generally intelligent agents deal with generally intelligent subagents, child agents, etc. The conjecture doesn't even necessarily say what that pattern is, just that there's some consistent method there, which can hopefully generalize to our relationship with our eventual child agents (i.e. AGI smarter than us).

Again, there's an empirical claim here, and it's testable in principle... though this one is rather more dangerous to test, unless the pattern of interest generalizes down to agents less capable than humans. (Which seems unlikely, since the sort pattern Eliezer expects is not present much in humans if I understand correctly; it requires substantially superhuman capabilities.) Also it's a very indirect specification.

Subproblem!Corrigibility

Quoting David Lorell and myself [LW · GW] (with some substantial edits of details not centrally relevant here):

… the gaps/subproblems in humans’ plans are typically modular - i.e. we expect to be able to solve each subproblem without significantly changing the “outer” partial plan, and without a lot of coupling between different subproblems. That’s what makes the partial plan with all its subproblems useful in the first place: it factors the problem into loosely-coupled subproblems.

On this model, part of what it implicitly means to “solve” a subproblem (or instrumental goal) is that the “solution” should roughly preserve the modularity of the subproblem. That means the solution should not have a bunch of side effects which might mess with other subproblems, or mess up the outer partial plan. Furthermore, since the rest of the problem might not have been solved yet, the solution needs to work for a whole broad class of potential contexts, a whole broad class of ways-the-rest-of-the-problem-might-be-solved. So, the solution needs to robustly not have side effects which mess up the rest of the plan, across a wide range of possibilities for what “the rest of the plan” might be. Also, ideally the solution would make available plenty of relevant information about what it’s doing, so that other parts of the plan can use information found in the process of solving the subproblem. And that all sounds an awful lot like corrigibility.

Bringing it back to patterns: we conjecture that there exist many patterns in the subproblems which arise when solving problems in our world. In particular, subproblems typically implicitly involve some kind of robust modularity, so solutions won't interfere with whatever else is needed to solve other subproblems. Also, subproblems typically implicitly involve making information visible to the rest of the system, just in case it’s needed.

Again, there's a testable empirical claim here, in principle, about the structure of subproblems which convergently come up in the process of solving problems.

Exercise: Do What I Mean (DWIM)

I haven't thought much about what patterns need to hold in the environment in order for "do what I mean" to make sense at all. But it's a natural next target in this list, so I'm including it as an exercise for readers: what patterns need to hold in the environment in order for "do what I mean" to make sense at all? Note that either necessary or sufficient conditions on such patterns can constitute marginal progress on the question.

Putting It All Together, and Takeaways

What does it mean to align artificial general intelligence to human values (or corrigibility)? Putting together all the pieces from this post:

 That’s a lot of pieces, each of which is fairly complex on its own. It would be somewhat surprising if all of it was exactly correct. So to wrap up, a reminder: had 15th century humans said “hmm, we don’t have any rigorous peer-reviewed research about these supposed pecking orders, we should consider them unscientific and unfit for reasoning”, they would have moved further from full understanding, not closer. And so it is today, with alignment.

The picture sketched is complicated and deep, but we have at least some prior evidence (intuition, arguments) separately in each piece, so even if one piece is wrong it doesn’t necessarily break everything else. Ideally, we’d like to both test the pieces, and iterate on our own understanding of the patterns underlying our own concepts.

Acknowledgements: Though David Lorell is not a coauthor on this particular post, much of the ideas and framing were worked out with him. Also, a few parts - especially the teleology section - benefited from discussions with Steve Petersen and Ramana Kumar. Thank you!

  1. ^

    Actually historically I think the theory and experiment were handled together, but I want to emphasize that both components are load-bearing.

  2. ^

    … which notably does not imply that they’re correct.

  3. ^

    In a compression context, note that we usually have nontrivial probability on a few near-shortest programs, not just on the single shortest program. This is especially important with multiple agents, since different priors will typically lead to some small disagreement about which near-shortest programs are in fact shortest. Those are two big reasons, among others, why "just use the single shortest program" is not a compelling resolution to underdetermination.

  4. ^

    Rewording that sentence to properly account for sexual reproduction is left as an exercise to the reader.

  5. ^

    Why highlight basic agency in particular as a natural type of agency to focus on? I find it particularly interesting because it distills the core idea of Maxwell's demon: a system which observes its environment, then takes actions as a function of its observations, in such a way that the system is steered into a relatively-narrow outcome space. That framing strongly suggests that basic agency is the right notion for thermodynamic agency models. Indeed, David and I have at least one simple theorem along those lines in the writeup queue.

  6. ^

    This example also highlights the question of what patterns in the environment constitute "control", which is a whole 'nother can of worms.

  7. ^

    General intelligence/agency is not strictly synonymous with general-purpose search, but they’re pretty closely related conceptually.

  8. ^

    Note that, while the type signature of goals is an open problem, the answer is definitely not "natural language". So there's an additional subtlety here about how exactly a natural language "goal specification" maps to an actual goal. For instance, if I ask the system to light a candle, does that natural language "goal specification" implicitly include not lighting the rest of my house on fire? An actual goal includes that sort of detail. And as the fire example suggests, the mapping between natural language specification and an actual goal is quite load-bearing for questions of alignment.

  9. ^

    Note that, in using "anti-endorsement" and "avoidance" as evidence of values, we're relying on stated and revealed preferences as proxy measures of values. Stated and revealed preferences are not synonymous with values, but they are useful proxies!

  10. ^

    ... of course a natural question is then "John, why are you giving so many definitions, if you explicitly recommend people not do that?". And the short answer is that I think I have done a much better job than approximately 100% of cases.

1 comments

Comments sorted by top scores.

comment by Noosphere89 (sharmake-farah) · 2025-01-16T01:57:11.276Z · LW(p) · GW(p)

Third reason “patterns not holding” is less central an issue than it might seem: the Generalized Correspondence Principle. When quantum mechanics or general relativity came along, they still had to agree with classical mechanics in all the (many) places where classical mechanics worked. More generally: if some pattern in fact holds, then it will still be true that the pattern held under the original context even if later data departs from the pattern, and typically the pattern will generalize in some way to the new data. Prototypical example: maybe in the blegg/rube example, some totally new type of item is introduced, a gold donut (“gonut”). And then we’d have a whole new cluster, but the two old clusters are still there; the old pattern is still present in the environment.

 

While a trivial version of something like this holds true, the Correspondence principle doesn't apply everywhere, and while there are 2 positive results on a correspondence theorem holding, there is a negative result stating that the correspondence principle is false in the general case of physical laws/rules whose only requirement is that they be Turing-computable, which means that there's no way to make theories all add up to normality in all cases.

More here:

https://www.lesswrong.com/posts/XMGWdfTC7XjgTz3X7/a-correspondence-theorem-in-the-maximum-entropy-framework [LW · GW]

https://www.lesswrong.com/posts/FWuByzM9T5qq2PF2n/a-correspondence-theorem [LW · GW]

https://www.lesswrong.com/posts/74crqQnH8v9JtJcda/egan-s-theorem#oZNLtNAazf3E5bN6X [LW(p) · GW(p)]

https://www.lesswrong.com/posts/74crqQnH8v9JtJcda/egan-s-theorem#M6MfCwDbtuPuvoe59 [LW(p) · GW(p)]

https://www.lesswrong.com/posts/74crqQnH8v9JtJcda/egan-s-theorem#XQDrXyHSJzQjkRDZc [LW(p) · GW(p)]