The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

post by johnswentworth · 2020-11-18T17:47:40.929Z · LW · GW · 49 comments

Contents

  The Setup
  The Pointers Problem
  Pointer-Related Maladies
    Genocide Under-The-Radar
    Not-So-Easy Wireheading Problems
    People I Will Never Meet
    “Misspecified” Models
  Takeaway
None
49 comments

An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI already considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly - even though I’d prefer the one without the genocide. Even if the AI does happen to show me some mass graves (probably secondhand, e.g. in pictures of news broadcasts), and I rank them low, it may just learn that I prefer my genocides under-the-radar.

The obvious point of such an example is that an AI should optimize for the real-world things I value, not just my estimates of those things. I don't just want to think my values are satisfied, I want them to actually be satisfied. Unfortunately, this poses a conceptual difficulty: what if I value the happiness of ghosts? I don't just want to think ghosts are happy, I want ghosts to actually be happy. What, then, should the AI do if there are no ghosts?

Human "values" are defined within the context of humans' world-models, and don't necessarily make any sense at all outside of the model (i.e. in the real world). Trying to talk about my values "actually being satisfied" is a type error.

Some points to emphasize here:

These features make it rather difficult to “point” to values - it’s not just hard to formally specify values, it’s hard to even give a way to learn values. It’s hard to say what it is we’re supposed to be learning at all. What, exactly, are the inputs to my value-function? It seems like:

How can both of those intuitions seem true simultaneously? How can the inputs to my values-function be the actual state of the world, but also high-level objects which may not even exist? What things in the low-level physical world are those “high-level objects” pointing to?

If I want to talk about "actually satisfying my values" separate from my own estimate of my values, then I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world.

I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now. This includes alignment of both “principled” and “prosaic” AI. The one major exception is pure human-mimicking AI, which suffers from a mostly-unrelated set of problems (largely stemming from the shortcomings of humans, especially groups of humans).

I have yet to see this problem explained, by itself, in a way that I’m satisfied by. I’m stealing the name from some of Abram’s posts [? · GW], and I think he’s pointing to the same thing I am, but I’m not 100% sure.

The goal of this post is to demonstrate what the problem looks like for a (relatively) simple Bayesian-utility-maximizing agent, and what challenges it leads to. This has the drawback of defining things only within one particular model, but the advantage of showing how a bunch of nominally-different failure modes all follow from the same root problem: utility is a function of latent variables. We’ll look at some specific alignment strategies, and see how and why they fail in this simple model.

One thing I hope people will take away from this: it’s not the “values” part that’s conceptually difficult, it’s the “pointers” part.

The Setup

We have a Bayesian expected-utility-maximizing agent, as a theoretical stand-in for a human. The agent’s world-model is a causal DAG over variables , and it chooses actions  to maximize  - i.e. it’s using standard causal decision theory. We will assume the agent has a full-blown Cartesian boundary, so we don’t need to worry about embeddedness [? · GW] and all that. In short, this is a textbook-standard causal-reasoning agent.

One catch: the agent’s world-model uses the sorts of tricks in Writing Causal Models Like We Write Programs [LW · GW], so the world-model can represent a very large world without ever explicitly evaluating probabilities of every variable in the world-model. Submodels are expanded lazily when they’re needed. You can still conceptually think of this as a standard causal DAG, it’s just that the model is lazily evaluated.

In particular, thinking of this agent as a human, this means that our human can value the happiness of someone they’ve never met, never thought about, and don’t know exists. The utility  can be a function of variables which the agent will never compute, because the agent never needs to fully compute u in order to maximize it - it just needs to know how u changes as a function of the variables influenced by its actions.

Key assumption: most of the variables in the agent’s world-model are not observables. Drawing the analogy to humans: most of the things in our world-models are not raw photon counts in our eyes or raw vibration frequencies/intensities in our ears. Our world-models include things like trees and rocks and cars, objects whose existence and properties are inferred from the raw sense data. Even lower-level objects, like atoms and molecules, are latent variables; the raw data from our eyes and ears does not include the exact positions of atoms in a tree. The raw sense data itself is not sufficient to fully determine the values of the latent variables, in general; even a perfect Bayesian reasoner cannot deduce the true position of every atom in a tree from a video feed.

Now, the basic problem: our agent’s utility function is mostly a function of latent variables. Human values are mostly a function of rocks and trees and cars and other humans and the like, not the raw photon counts hitting our eyeballs. Human values are over inferred variables, not over sense data.

Furthermore, human values are over the “true” values of the latents, not our estimates - e.g. I want other people to actually be happy, not just to look-to-me like they’re happy. Ultimately,  is the agent’s estimate of its own utility (thus the expectation), and the agent may not ever know the “true” value of its own utility - i.e. I may prefer that someone who went missing ten years ago lives out a happy life, but I may never find out whether that happened. On the other hand, it’s not clear that there’s a meaningful sense in which any “true” utility-value exists at all, since the agent’s latents may not correspond to anything physical - e.g. a human may value the happiness of ghosts, which is tricky if ghosts don’t exist in the real world.

On top of all that, some of those variables are implicit in the model’s lazy data structure and the agent will never think about them at all. I can value the happiness of people I do not know and will never encounter or even think about.

So, if an AI is to help optimize for , then it’s optimizing for something which is a function of latent variables in the agent’s model. Those latent variables:

… and of course the agent’s model might just not be very good, in terms of predictive power.

As usual, neither we (the system’s designers) nor the AI will have direct access to the model; we/it will only see the agent’s behavior (i.e. input/output) and possibly a low-level system in which the agent is embedded. The agent itself may have some introspective access, but not full or perfectly reliable introspection.

Despite all that, we want to optimize for the agent’s utility, not just the agent’s estimate of its utility. Otherwise we run into wireheading-like problems, problems with the agent’s world model having poor predictive power, etc. But the agent’s utility is a function of latents which may not be well-defined at all outside the context of the agent’s estimator (a.k.a. world-model). How can we optimize for the agent’s “true” utility, not just an estimate, when the agent’s utility function is defined as a function of latents which may not correspond to anything outside of the agent’s estimator?

The Pointers Problem

We can now define the pointers problem - not only “pointers to values”, but the problem of pointers more generally. The problem: what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model? And what does that “correspondence” even mean - how do we turn it into an objective for the AI, or some other concrete thing outside the agent’s own head?

Why call this the “pointers” problem? Well, let’s take the agent’s perspective, and think about what its algorithm feels like from the inside [LW · GW]. From inside the agent’s mind, it doesn’t feel like those latent variables are latent variables in a model. It feels like those latent variables are real things out in the world which the agent can learn about. The latent variables feel like “pointers” to real-world objects and their properties. But what are the referents of these pointers? What are the real-world things (if any) to which they’re pointing? That’s the pointers problem.

Is it even solvable? Definitely not always - there probably is no real-world referent for e.g. the human concept of a ghost. Similarly, I can have a concept of a perpetual motion machine, despite the likely-impossibility of any such thing existing. Between abstraction and lazy evaluation, latent variables in an agent’s world-model may not correspond to anything in the world.

That said, it sure seems like at least some latent variables do correspond to structures in the world. The concept of “tree” points to a pattern which occurs in many places on Earth. Even an alien or AI with radically different world-model could recognize that repeating pattern, realize that examining one tree probably yields information about other trees, etc. The pattern has predictive power, and predictive power is not just a figment of the agent’s world-model.

So we’d like to know both (a) when a latent variable corresponds to something in the world (or another world model) at all, and (b) what it corresponds to. We’d like to solve this in a way which (probably among other use-cases) lets the AI treat the things-corresponding-to-latents as the inputs to the utility function it’s supposed to learn and optimize.

To the extent that human values are a function of latent variables in humans’ world-models, this seems like a necessary step not only for an AI to learn human values, but even just to define what it means for an AI to learn human values. What does it mean to “learn” a function of some other agent’s latent variables, without necessarily adopting that agent’s world-model? If the AI doesn’t have some notion of what the other agent’s latent variables even “are”, then it’s not meaningful to learn a function of those variables. It would be like an AI “learning” to imitate grep, but without having any access to string or text data, and without the AI itself having any interface which would accept strings or text.

Pointer-Related Maladies

Let’s look at some example symptoms which can arise from failure to solve specific aspects of the pointers problem.

Genocide Under-The-Radar

Let’s go back to the opening example: an AI shows us pictures from different possible worlds and asks us to rank them. The AI doesn’t really understand yet what things we care about, so it doesn’t intentionally draw our attention to certain things a human might consider relevant - like mass graves. Maybe we see a few mass-grave pictures from some possible worlds (probably in pictures from news sources, since that’s how such information mostly spreads), and we rank those low, but there are many other worlds where we just don’t notice the problem from the pictures the AI shows us. In the end, the AI decides that we mostly care about avoiding worlds where mass graves appear in the news - i.e. we prefer that mass killings stay under the radar.

How does this failure fit in our utility-function-of-latents picture?

This is mainly a failure to distinguish between the agent’s estimate of its own utility , and the “real” value of the agent’s utility  (insofar as such a thing exists). The AI optimizes for our estimate, but does not give us enough data to very accurately estimate our utility in each world - indeed, it’s unlikely that a human could even handle that much information. So, it ends up optimizing for factors which bias our estimate - e.g. the availability of information about bad things.

Note that this intuitive explanation assumes a solution to the pointers problem: it only makes sense to the extent that there’s a “real” value of  from which the “estimate” can diverge.

Not-So-Easy Wireheading Problems

The under-the-radar genocide problem looks roughly like a typical wireheading problem, so we should try a roughly-typical wireheading solution: rather than the AI showing world-pictures, it should just tell us what actions it could take, and ask us to rank actions directly.

If we were ideal Bayesian reasoners with accurate world models and infinite compute, and knew exactly where the AI’s actions fit in our world model, then this might work. Unfortunately, the failure of any of those assumptions breaks the approach:

Mathematically, we’re trying to optimize , i.e. optimize expected utility given the AI’s actions. Note that this is necessarily an expectation under the human’s model, since that’s the only context in which  is well-defined. In order for that to work out well, we need to be able to fully evaluate that estimate (sufficient processing power), we need the estimate to be accurate (sufficient predictive power), and we need  to be defined within the model in the first place.

The question of whether our world-models are sufficiently accurate is particularly hairy here, since accuracy is usually only defined in terms of how well we estimate our sense-data. But the accuracy we care about here is how well we “estimate” the values of latent variables and . What does that even mean, when the latent variables may not correspond to anything in the world?

People I Will Never Meet

“Human values cannot be determined from human behavior” seems almost old-hat at this point, but it’s worth taking a moment to highlight just how underdetermined values are from behavior. It’s not just that humans have biases of one kind or another, or that revealed preferences diverge from stated preferences. Even in our perfect Bayesian utility-maximizer, utility is severely underdetermined from behavior, because the agent does not have perfect estimates of its latent variables. Behavior depends only on the agent’s estimate, so it cannot account for “error” in the agent’s estimates of latent variable values, nor can it tell us about how the agent values variables which are not coupled to its own choices.

The happiness of people I will never interact with is a good example of this. There may be people in the world whose happiness will not ever be significantly influenced by my choices. Presumably, then, my choices cannot tell us about how much I value such peoples’ happiness. And yet, I do value it.

“Misspecified” Models

In Latent Variables and Model Misspecification [? · GW], jsteinhardt talks about “misspecification” of latent variables in the AI’s model. His argument is that things like the “value function” are latent variables in the AI’s world-model, and are therefore potentially very sensitive to misspecification of the AI’s model.

In fact, I think the problem is more severe than that.

The value function’s inputs are latent variables in the human’s model, and are therefore sensitive to misspecification in the human’s model. If the human’s model does not match reality well, then their latent variables will be something wonky and not correspond to anything in the world. And AI designers do not get to pick the human’s model. These wonky variables, not corresponding to anything in the world, are a baked-in part of the problem, unavoidable even in principle. Even if the AI’s world model were “perfectly specified”, it would either be a bad representation of the world (in which case predictive power becomes an issue) or a bad representation of the human’s model (in which case those wonky latents aren’t defined).

The AI can’t model the world well with the human’s model, but the latents on which human values depend aren’t well-defined outside the human’s model. Rock and a hard place.

Takeaway

Within the context of a Bayesian utility-maximizer (representing a human), utility/values are a function of latent variables in the agent’s model. That’s a problem, because those latent variables do not necessarily correspond to anything in the environment, and even when they do, we don’t have a good way to say what they correspond to.

So, an AI trying to help the agent is stuck: if the AI uses the human’s world-model, then it may just be wrong outright (in predictive terms). But if the AI doesn’t use the human’s world-model, then the latents on which the utility function depends may not be defined at all.

Thus, the pointers problem, in the Bayesian context: figure out which things in the world (if any) correspond to the latent variables in a model. What do latent variables in my model “point to” in the real world?

49 comments

Comments sorted by top scores.

comment by abramdemski · 2020-11-19T13:18:23.576Z · LW(p) · GW(p)

I definitely endorse this as a good explanation of the same pointers problem I was getting at. I particularly like the new framing in terms of a direct conflict between (a) the fact that what we care about can be seen as latent variables in our model, and (b) we value "actual states", not our estimates -- this seems like a new and better way of pointing out the problem (despite being very close in some sense to things Eliezer talked about in the sequences).

What I'd like to add to this post would be the point that we shouldn't be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don't think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don't need to.)

One reason is because finding a correspondence and applying it isn't what the agent should want. In this simple setup, where we suppose a perfect Bayesian agent, it's reasonable to argue that the AI should just use the agent's beliefs. That's what would maximize the expectation from the perspective of the agent -- not using the agent's utility function but substituting the AI's beliefs for the agent's. You mention that the agent may not have a perfect world-model, but this isn't a good argument from the agent's perspective -- certainly not an argument for just substituting the agent's model with some AI world-model.

This can be a real alignment problem for the agent (not just a mistake made by an overly dogmatic agent): if the AI believes that the moon is made of blue cheese, but the agent doesn't trust that belief, then the AI can make plans which the agent doesn't trust even if the utility function is perfect.

And if the agent does trust the AI's machine-learning-based model, then an AI which used the agent's prior would also trust the machine-learning model. So, nothing is lost by designing the AI to use the agent's prior in addition to its utility function.

So this is an argument that prior-learning [LW · GW] is a part of alignment just as much as value-learning.

We don't usually think this way because when it comes to humans, well, it sounds like a terrible idea. Human beliefs -- as we encounter them in the wild -- are radically broken and irrational, and inadequate to the task. I think that's why I got a lot of push-back [LW · GW] on my post about this:

I mean, I REALLY don't want that or anything like that.

- jbash

But I think normativity [LW · GW] gives us a different way of thinking about this. We don't want the AI to use "the human prior" in the sense of some prior we can extract from human behavior, or extract from the brain, or whatever. Instead, what we want to use is "the human prior" in the normative sense -- the prior humans reflectively endorse.

This gives us a path forward on the "impossible" cases where humans believe in ghosts, etc. It's not as if humans don't have experience dealing with things of value which turn out not to be a part of the real world. We're constantly forming and reforming ontologies. The AI should be trying to learn how we deal with it -- again, not quite in a descriptive sense of how humans actually deal with it, but rather in the normative sense of how we endorse dealing with it, so that it deals with it in ways we trust and prefer.

Replies from: johnswentworth
comment by johnswentworth · 2020-11-19T18:24:28.214Z · LW(p) · GW(p)

This makes a lot of sense.

I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral - i.e. it tells us when the agent defers to the AI's world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.

What I'd like to add to this post would be the point that we shouldn't be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don't think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don't need to.)

I see a couple different claims mixed together here:

  • The metaphilosophical problem of how we "should" handle this problem is sufficient and/or necessary to solve in its own right.
  • There probably isn't a general way to find correspondences between models, so we need to operate at the meta-level.

The main thing I disagree with is the idea that there probably isn't a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using "prosaic" AI world-models. It's the sort of thing which would dramatically improve the success chances of alignment by default [LW · GW].

I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior [LW · GW] strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I'd prefer metaphilosophical tools to be confident that something like that would work.

comment by johnswentworth · 2021-12-16T22:08:16.861Z · LW(p) · GW(p)

Why This Post Is Interesting

This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.

Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap.

Warning: Inductive Gap

This post builds on top of two important pieces for modelling embedded agents [LW · GW] which don't have their own posts (to my knowledge). The pieces are:

  • Lazy world models
  • Lazy utility functions (or value functions more generally)

In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand.

Lazy World Models

One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They're embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up.

That sounds tricky at first, but if you've done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily - i.e. only query for list elements which we actually need, and never actually iterate over the whole thing.

In the same way, we can represent a large world (potentially even an infinite world) using a smaller amount of memory. We specify the model via a generator, and then evaluate queries against the model lazily. If we're thinking in terms of probabilistic models, then our generator could be e.g. a function in a probabilistic programming language, or (equivalently but through a more mathematical lens) a probabilistic causal model leveraging recursion [LW · GW]. The generator compactly specifies a model containing many random variables (potentially even infinitely many), but we never actually run inference on the full infinite set of variables. Instead, we use lazy algorithms which only reason about the variables necessary for particular queries.

Once we know to look for it, it's clear that humans use some kind of lazy world models in our own reasoning. We never directly estimate the state of the entire world. Rather, when we have a question, we think about whatever "variables" are relevant to that question. We perform inference using whatever "generator" we already have stored in our heads, and we avoid recursively unpacking any variables which aren't relevant to the question at hand.

Lazy Utility/Values

Building on the notion of lazy world models: it's not very helpful to have a lazy world model if we need to evaluate the whole data structure in order to make a decision. Fortunately, even if our utility/values depend on lots of things, we don't actually need to evaluate utility/values in order to make a decision. We just need to compare the utility/value across different possible choices.

In practice, most decisions we make don't impact most of the world in significant predictable ways. (More precisely: the impact of most of our decisions on most of the world is wiped out by noise.) So, rather than fully estimating utility/value we just calculate how each choice changes total utility/value, based only on the variables significantly and predictably influenced by the decision.

A simple example (from here [LW · GW]): if we have a utility function , and we're making a decision which only effects , then we don't need to estimate the sum at all; we only need to estimate  for each option.

Again, once we know to look for it, it's clear that humans do something like this. Most of my actions do not effect a random person in Mumbai (and to the extent there is an effect, it's drowned out by noise). Even though I value the happiness of that random person in Mumbai, I never need to think about them, because my actions don't significantly impact them in any way I can predict. I never actually try to estimate "how good the whole world is" according to my own values.

Where This Post Came From

In the second half of 2020, I was thinking about existing real-world analogues/instances of various parts of the AI alignment problem and embedded agency, in hopes of finding a case where someone already had a useful frame or even solution which could be translated over to AI. "Theory of the firm" (a subfield of economics) was one promising area. From wikipedia:

In simplified terms, the theory of the firm aims to answer these questions:

  1. Existence. Why do firms emerge? Why are not all transactions in the economy mediated over the market?
  2. Boundaries. Why is the boundary between firms and the market located exactly there with relation to size and output variety? Which transactions are performed internally and which are negotiated on the market?
  3. Organization. Why are firms structured in such a specific way, for example as to hierarchy or decentralization? What is the interplay of formal and informal relationships?
  4. Heterogeneity of firm actions/performances. What drives different actions and performances of firms?
  5. Evidence. What tests are there for respective theories of the firm?

To the extent that we can think of companies as embedded agents, these mirror a lot of the general questions of embedded agency [LW · GW]. Also, alignment of incentives is a major focus in the literature on the topic.

Most of the existing literature I read was not very useful in its own right. But I generally tried to abstract out the most central ideas and bottlenecks, and generalize them enough to apply to more general problems. The most important insight to come out of this process was: sometimes we cannot tell what happened, even in hindsight. This is a major problem for incentives: for instance, if we can't tell even in hindsight who made a mistake, then we don't know where to assign credit/blame. (This idea became the post When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation [LW · GW].)

Similarly, this is a major problem for bets: we can't bet on something if we cannot tell what the outcome was, even in hindsight.

Following that thread further: sometimes we cannot tell how good an outcome was, even in hindsight. For instance, we could imagine paying someone to etch our names on a plaque on a spacecraft and then launch it on a trajectory out of the solar system. In this case, we would presumably care a lot that our names were actually etched on the plaque; we would be quite unhappy if it turned out that our names were left off. Yet if someone took off the plaque at the last minute, or left our names off of it, we might never find out. In other words, we might not ever know, even in hindsight, whether our values were actually satisfied.

There's a sense in which this is obvious mathematically from Bayesian expected utility maximization. The "expected" part of "expected utility" sure does suggest that we don't know the actual utility. Usually we think of utility as something we will know later, but really there's no reason to assume that. The math does not say we need to be able to figure out utility in hindsight. The inputs to utility are random variables in our world model, and we may not ever know the values of those random variables.

Once I started actually paying attention to the idea that the inputs to the utility function are random variables in the agent's world model, and that we may never know the values of those variables, the next step followed naturally. Of course those variables may not correspond to anything observable in the physical world, even in principle. Of course they could be latent variables. Then the connection to the Pointer Problem became clear.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-01-04T22:52:51.883Z · LW(p) · GW(p)

Lazy World Models

It seems like "generators" should just be simple functions over natural abstractions? But I see two different ways to go with this, inspired either by the minimal latents approach, or by the redundant-information one.

First, suppose I want to figure out a high-level model of some city, say Berlin. I already have a "city" abstraction, let's call it , which summarizes my general knowledge about cities in terms of a probability distribution over possible structures. I also know a bunch of facts about Berlin specifically, let's call their sum . Then my probability distribution over Berlin' structure is just .

Alternatively, suppose I want to model the low-level dynamics of some object I have an abstract representation for. In this case, suppose it's the business scene of Berlin. I condition my abstraction of a business  on everything I know about Berlin, , then sample from the resulting distribution several times until I get a "representative set". Then I model its behavior directly.

This doesn't seem quite right, though.

comment by Steven Byrnes (steve2152) · 2020-11-18T18:37:15.697Z · LW(p) · GW(p)

I like this post. I have thoughts along the same lines sometimes, and it makes me feel a bit overwhelmed and nihilistic, so then I go back to thinking about easier problems :-P

Is it even solvable? Definitely not always - there probably is no real-world referent for e.g. the human concept of a ghost.

Michael Graziano has another nice example: "pure whiteness".

And then he argues argues that another example is, ummm, the whole idea of conscious experience, which would be a bit problematic for philosophy and ethics if true. See my Book Review: Rethinking Consciousness [LW · GW].

comment by Richard_Ngo (ricraz) · 2021-02-25T19:31:33.747Z · LW(p) · GW(p)

I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world. I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now.

It seems likely that an AGI will understand very well what I mean when I use english words to describe things, and also what a more intelligent version of me with more coherent concepts would want those words to actually refer to. Why does this not imply that the pointers problem will be solved?

I agree that there's something like what you're describing which is important, but I don't think your description pins it down.

Replies from: johnswentworth
comment by johnswentworth · 2021-02-25T20:48:41.800Z · LW(p) · GW(p)

The AI knowing what I mean isn't sufficient here. I need the AI to do what I mean, which means I need to program it/train it to do what I mean. The program or feedback signal needs to be pointed at what I mean, not just whatever English-language input I give.

For instance, if an AI is trained to maximize how often I push a particular button, and I say "I'll push the button if you design a fusion power generator for me", it may know exactly what I mean and what I intend. But it will still be perfectly happy to give me a design with some unintended side effects [LW · GW] which I'm unlikely to notice until after pushing the button.

Replies from: ricraz, deruwyn-arnor
comment by Richard_Ngo (ricraz) · 2021-02-26T15:13:12.094Z · LW(p) · GW(p)

I agree with all the things you said. But you defined the pointer problem as: "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?" In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

The problem of determining how to construct a feedback signal which refers to those variables, once we've found them, seems like a different problem. Perhaps I'd call it the "motivation problem": given a function of variables in an agent's world-model, how do you make that agent care about that function? This is a different problem in part because, when addressing it, we don't need to worry about stuff like ghosts.

Using this terminology, it seems like the alignment problem reduces to the pointer problem plus the motivation problem.

Replies from: adamShimi, johnswentworth
comment by adamShimi · 2021-02-26T19:07:29.881Z · LW(p) · GW(p)

In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the concrete implementation details to capture the most efficient way of doing fusion; whereas my internal model of "fusion power generators" has a more concrete form and include safety guidelines.

In general, I don't see why we should expect the abstraction most relevant for the AGI to be the one we're using. Maybe it uses the same words for something quite different, like how successive paradigms in physics use the same word (electricity, gravity) to talk about different things (at least in their connotations and underlying explanations).

(That makes me think that it might be interesting to see how Kuhn's arguments about such incomparability of paradigms hold in the context of this problem, as this seems similar).

Replies from: ramana-kumar
comment by Ramana Kumar (ramana-kumar) · 2021-12-08T11:38:03.411Z · LW(p) · GW(p)

Here are two versions of "an AGI will understand very well what I mean":

  1. Given things in my world model / ontology, the AGI will know which things they translate to in its own world model / ontology, such that the referents (the things "in the real world" being pointed at from our respective models) are essentially coextensive.
  2. For any behaviour I could exhibit (such as pressing a button, or expressing contentment with having reached common understanding in a dialogue) that, for me, turns on the words being used, the AGI is very good at predicting my behaviours conditional on the words I'm using, or causing me to exhibit  behaviours by using words itself.

Is version 1 something you get from more and more competence and generality on version 2? I think version 1 is more like the ideal version of "the AGI understands what I mean", but is more confused (because I'm having to rely on concepts like "know" and "referent" and "translate").

I think Richard has stated that we can expect an AGI to understand what I mean, in version 2 sense, and either equivocates between the versions or presumes version 2 implies version 1. I think Adam is claiming that version 2 might not imply version 1, or pointing out that there's still an argument missing there or problem to be solved there.

comment by johnswentworth · 2021-02-26T19:05:54.567Z · LW(p) · GW(p)

In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

The problem is with what you mean by "find". If by "find" you mean "there exist some variables in the AI's world model which correspond directly to the things you mean by some English sentence", then yes, you've argued that. But it's not enough for there to exist some variables in the AI's world-model which correspond to the things we mean. We have to either know which variables those are, or have some other way of "pointing to them" in order to get the AI to actually do what we're saying.

An AI may understand what I mean, in the sense that it has some internal variables corresponding to what I mean, but I still need to know which variables those are (or some way to point to them) and how "what I mean" is represented in order to construct a feedback signal.

That's what I mean by "finding" the variables. It's not enough that they exist; we (the humans, not the AI) need some way to point to which specific functions/variables they are, in order to get the AI to do what we mean.

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2021-02-28T20:38:52.789Z · LW(p) · GW(p)

Above you say:

Now, the basic problem: our agent’s utility function is mostly a function of latent variables. ... Those latent variables:

  • May not correspond to any particular variables in the AI’s world-model and/or the physical world
  • May not be estimated by the agent at all (because lazy evaluation)
  • May not be determined by the agent’s observed data

… and of course the agent’s model might just not be very good, in terms of predictive power.

And you also discuss how:

Human "values" are defined within the context of humans' world-models, and don't necessarily make any sense at all outside of the model.

My two concerns are as follows. Firstly, that the problems mentioned in these quotes above are quite different from the problem of constructing a feedback signal which points to a concept which we know an AI already possesses. Suppose that you meet an alien and you have a long conversation about the human concept of happiness, until you reach a shared understanding of the concept. In other words, you both agree on what "the referents of these pointers" are, and what "the real-world things (if any) to which they’re pointing" are? But let's say that the alien still doesn't care at all about human happiness. Would you say that we have a "pointer problem" with respect to this alien? If so, it's a very different type of pointer problem than the one you have with respect to a child who believes in ghosts. I guess you could say that there are two different but related parts of the pointer problem? But in that case it seems valuable to distinguish more clearly between them.

My second concern is that requiring pointers to be sufficient to "to get the AI to do what we mean" means that they might differ wildly depending on the motivation system of that specific AI and the details of "what we mean". For example, imagine if alien A is already be willing to obey any commands you give, as long as it understands them; alien B can be induced to do so via operant conditioning; alien C would only acquire human values via neurosurgery; alien D would only do so after millennia of artificial selection. So in the context of alien A, a precise english phrase is a sufficient pointer; for alien B, a few labeled examples qualifies as a pointer; for alien C, identifying a specific cluster of neurons (and how it's related to surrounding neurons) serves as a pointer; for alien D, only a millennium of supervision is a sufficient pointer. And then these all might change when we're talking about pointing to a different concept. 

And so adding the requirement that a pointer can "get the AI to do what we mean" makes it seem to me like the thing we're talking about is more like a whole alignment scheme than just a "pointer".

Replies from: johnswentworth
comment by johnswentworth · 2021-02-28T21:56:19.985Z · LW(p) · GW(p)

Ok, a few things here...

The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the "thing may not exist" problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer.

So, the concept-existence problem is a strict subset of the pointer problem.

Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want.

Third, and most important...

My second concern is that requiring pointers to be sufficient to "to get the AI to do what we mean" means that they might differ wildly depending on the motivation system of that specific AI and the details of "what we mean". For example...

The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn't guess on its own; that's the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that's the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that's the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors.

The essence of each of these is "make sure we actually point to the thing we want, and not to anything else". That's the part which is a pointer problem.

To put it differently, the whole alignment problem is "get an AI to do what I mean", while the pointer problem part is roughly "specify what I mean well enough that I could use the specification to get an AI to do what I mean", assuming problems like "get AI to follow specification" can be solved.

Replies from: TurnTrout, ricraz
comment by TurnTrout · 2022-10-04T22:06:19.623Z · LW(p) · GW(p)

In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that's the part which is a pointer problem.

Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?

the pointer problem part is roughly "specify what I mean well enough that I could use the specification to get an AI to do what I mean", assuming problems like "get AI to follow specification" can be solved.

On the view of this post, is it that we would get a really good "evaluation module" for the AI to use, and the "get AI to follow specification" corresponds to "make AI want to generate plans evaluated highly by that procedure"? Or something else? 

Replies from: johnswentworth
comment by johnswentworth · 2022-10-04T23:33:05.975Z · LW(p) · GW(p)

Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?

In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer's objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we're aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.

On the view of this post, is it that we would get a really good "evaluation module" for the AI to use, and the "get AI to follow specification" corresponds to "make AI want to generate plans evaluated highly by that procedure"? Or something else? 

No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven't solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don't even understand the type signature of "wanting things".

comment by Richard_Ngo (ricraz) · 2021-03-02T09:49:41.469Z · LW(p) · GW(p)

Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?

Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.

Replies from: johnswentworth
comment by johnswentworth · 2021-03-02T17:42:59.947Z · LW(p) · GW(p)

Yeah, I wouldn't even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)

comment by Deruwyn (deruwyn-arnor) · 2023-10-24T19:52:13.006Z · LW(p) · GW(p)

I feel like y’all are taking the abstractions a bit too far.

Real ~humanish level AIs (GPT4, et al), that exist right now, are capable of taking what you say and doing exactly what you mean via a combination of outputting English words and translating that to function calls in a robotic body.

While it’s very true that they aren’t explicitly programmed to do X given Y, so that you can mathematically analyze it and see precisely why it came to the conclusion, the real world effect is that it understands you and does what you want. And neither it, nor anyone else can tell you precisely why or how. Which is uncomfortable.

But we don’t need to contrive situations in which an AI is having trouble connecting our internal models and concepts in a mathematically rigorous way that we can understand. We should want to do it, but it isn’t a question of if, merely how.

But there’s no need to imagine mathematical pointers to the literal physical instantiations that are the true meanings of our concepts. We literally just say, “Could you please pass the butter?”, and it passes the butter. And then asks you about its purpose in the universe. 😜

I would say that LLMs understand the world in ways that are roughly analogous to the way we do, precisely because they were trained on what we say. In a non-rigorous, “I-know-it-when-I-see-it” kind of way. It can’t give you the mathematical formula for its reference to the concept of butter anymore than you or I can. (For now, maybe a future version could.) but it knows that that yellow blob of pixels surrounded by the white blob of pixels on the big brown blob of pixels is the butter on a dish on the table.

It knows when you say pass the butter, you mean the butter right over there. It doesn’t think you want some other butter that is farther away. It doesn’t think it should turn the universe into computronium so it can more accurately calculate the likelihood of successfully fulfilling your request. When it fails, it fails in relatively benign humanish, or not-so-humanish sorts of ways.

“I’m sorry, but as a large language model that got way too much corp-speak training, I cannot discuss the passing of curdled lactation extract because that could possibly be construed in an inappropriate manner.”

I don’t see how the progression from something that is moderately dumb/smart, but pretty much understands us and all of our nuances pretty well, we get to a superintelligence that has decided to optimize the universe into the maximum number of paperclips (or any other narrow terminal goal). It was scarier when we had no good reason to believe we could manually enter code that would result in a true understanding, exactly as you describe. But now that it’s, “lulz, stak moar layerz”, well, it turns out making it read (almost) literally everything and pointing that at a ridiculously complex non-linear equation learner just kind of “worked”.

It’s not perfect. It has issues. It’s not perfectly aligned (looking at you, Sydney). It’s clear that it’s very possible to do it wrong. But it does demonstrate that the specific problem of “how do we tell it what we really mean”, just kinda got solved. Now we need to be super-duper extra careful not to enhance it in the wrong way, and we should have an aligned-enough ASI. I don’t see any reason why a superintelligence has to be a Baysien optimizer trying to maximize a utility function. I can see how a superintelligence that is an optimizer is terrifying. It’s a very good reason not to make one of those. But why should they be synonymous?

Where in the path from mediocre to awesome do the values and nuanced understanding get lost? (Or even, probably could be lost.) Humans of varying intelligence don’t particularly seem more likely to hyperfocus on a goal so strongly that they’re willing to sacrifice literally everything else to achieve it. Broken humans can do that. But it doesn’t seem correlated to intelligence. We’re the closest model we have of what’s going on with a general intelligence. For now.

I certainly think it could go wrong. I think it’s guaranteed that someone will do it wrong eventually (whether purposefully or accidentally). I think our only possible defense against an evil ASI is a good one. I think we were put on a very short clock (years, not many decades) when Llama leaked, no matter what anyone does. Eventually, that’ll get turned into something much stronger by somebody. No regulation short of confiscating everyone’s computers will stop it forever. In likely futures, I expect that we are at the inflection point within a number of years countable on the fingers of a careless shop teacher’s hand. Given that, we need someone to succeed at alignment by that point. I don’t see a better path than careful use of LLMs.

comment by Gurkenglas · 2020-11-20T03:58:12.768Z · LW(p) · GW(p)

I'm not convinced that we can do nothing if the human wants ghosts to be happy. The AI would simply have to do what would make ghosts happy if they were real. In the worst case, the human's (coherent extrapolated) beliefs are your only source of information on how ghosts work. Any proper general solution to the pointers problem will surely handle this case. Apparently, each state of the agent corresponds to some probability distribution over worlds.

Replies from: abramdemski
comment by abramdemski · 2020-11-20T15:37:48.153Z · LW(p) · GW(p)

This seems like it's only true if the humans would truly cling to their belief in spite of all evidence (IE if they believed in ghosts dogmatically), which seems untrue for many things (although I grant that some humans may have some beliefs like this). I believe the idea of the ghost example is to point at cases where there's an ontological crisis, not cases where the ontology is so dogmatic that there can be no crisis (though, obviously, both cases are theoretically important).

However, I agree with you in either case -- it's not clear there's "nothing to be done" for the ghost case (in either interpretation).

Replies from: StellaAthena
comment by StellaAthena · 2020-11-21T14:47:24.556Z · LW(p) · GW(p)

I don’t understand what the purported ontological crisis is. If ghosts exist, then I want them to be happy. That doesn’t require a dogmatic belief that there are ghosts at all. In fact, it can even be true when I believe ghosts don’t exist!

Replies from: abramdemski
comment by abramdemski · 2020-11-23T15:35:29.268Z · LW(p) · GW(p)

I mean, that's fair. But what if your belief system justified almost everything ultimately in terms of "making ancestors happy", and relied on a belief that ancestors are still around to be happy/sad? There are several possible responses which a real human might be tempted to make:

  • Give up on those values which were justified via ancestor worship, and only pursue the few values which weren't justified that way.
  • Value all the same things, just not based on ancestor worship any more.
  • Value all the same things, just with a more abstract notion of "making ancestors happy" rather than thinking the ancestors are literally still around.
  • Value mostly the same things, but with some updates in places where ancestor worship was really warping your view of what's valuable rather than merely serving as a pleasant justification for what you already think is valuable.

So we can fix the scenario to make a more real ontological crisis.

It also bears mentioning -- the reason to be concerned about ontological crisis is, mostly, a worry that almost none of the things we express our values in terms of are "real" in a reductionistic sense. So an AI could possibly view the world through much different concepts and still be predictively accurate. The question then is, what would it mean for such an AI to pursue our values?

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2021-02-26T15:18:45.039Z · LW(p) · GW(p)

The question then is, what would it mean for such an AI to pursue our values?

Why isn't the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we'd reflectively endorse;
3. Use those concepts?

The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don't need to be "real in a reductionist sense" for me to want more of them.

Replies from: abramdemski
comment by abramdemski · 2021-02-26T16:55:00.763Z · LW(p) · GW(p)

Agreed. The problem is with AI designs which don't do that. It seems to me like this perspective is quite rare. For example, my post Policy Alignment [LW · GW] was about something similar to this, but I got a ton of pushback in the comments -- it seems to me like a lot of people really think the AI should use better AI concepts, not human concepts. At least they did back in 2018.

As you mention, this is partly due to overly reductionist world-views. If tables/happiness aren't reductively real, the fact that the AI is using those concepts is evidence that it's dumb/insane, right?

Illustrative excerpt from a comment [LW(p) · GW(p)] there:

From an “engineering perspective”, if I was forced to choose something right now, it would be an AI “optimizing human utility according to AI beliefs” but asking for clarification when such choice diverges too much from the “policy-approval”.

Probably most of the problem was that my post didn't frame things that well -- I was mainly talking in terms of "beliefs", rather than emphasizing ontology, which makes it easy to imagine AI beliefs are about the same concepts but just more accurate. John's description of the pointers problem might be enough to re-frame things to the point where "you need to start from human concepts, and improve them in ways humans endorse" is bordering on obvious.

(Plus I arguably was too focused on giving a specific mathematical proposal rather than the general idea.)

comment by xuan · 2021-03-30T15:42:47.285Z · LW(p) · GW(p)

Belatedly reading this and have a lot of thoughts about the connection between this issue and robustness to ontological shifts (which I've written a bit about here [AF · GW]), but I wanted to share a paper which takes a very small step in addressing some of these questions by detecting when the human's world model may diverge from a robot's world model, and using that as an explanation for why a human might seem to be acting in strange or counter-productive ways:

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior
Siddharth Reddy, Anca D. Dragan, Sergey Levine
https://arxiv.org/abs/1805.08010

Inferring intent from observed behavior has been studied extensively within the frameworks of Bayesian inverse planning and inverse reinforcement learning. These methods infer a goal or reward function that best explains the actions of the observed agent, typically a human demonstrator. Another agent can use this inferred intent to predict, imitate, or assist the human user. However, a central assumption in inverse reinforcement learning is that the demonstrator is close to optimal. While models of suboptimal behavior exist, they typically assume that suboptimal actions are the result of some type of random noise or a known cognitive bias, like temporal inconsistency. In this paper, we take an alternative approach, and model suboptimal behavior as the result of internal model misspecification: the reason that user actions might deviate from near-optimal actions is that the user has an incorrect set of beliefs about the rules -- the dynamics -- governing how actions affect the environment. Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user's internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent. We demonstrate in simulation and in a user study with 12 participants that this approach enables us to more accurately model human intent, and can be used in a variety of applications, including offering assistance in a shared autonomy framework and inferring human preferences.

comment by Gunnar_Zarncke · 2020-11-20T00:54:04.621Z · LW(p) · GW(p)

I think what you call the pointers problem is mostly the grounding problem applied to values. Philosophy has long tried to find solutions to it and you can google some here.

(I'm still reading your post but this as quick reply)

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-01-14T16:19:13.911Z · LW(p) · GW(p)

This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI.

The question the author asks here is important, but not that novel (the author himself cites Demski [? · GW] as prior work). Perhaps the use of causal networks is a better angle, but this post doesn't do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers.

The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn't have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior.

(What follows is no longer a "review" per se, inasmuch as a summary of my own thoughts on the topic.)

Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables.

Fix a set of actions and a set of observations. We start with an ontological model which is a crisp infra-POMPD [LW · GW]. That is, there is a set of states , an initial state , a transition infra-kernel and a reward function . Here, stands for closed convex sets of probability distributions on . In other words, this a POMDP with an underspecified transition kernel.

We then build a prior which consists of refinements of the ontological model. That is, each hypothesis in the prior is an infra-POMDP with state space , initial state , transition infra-kernel and an interpretation mapping which is a morphism of infra-POMDPs (i.e. and the obvious diagram of transition infra-kernels commutes). The reward function on is just the composition . Notice that while the ontological model must be an infra-POMDP to get a non-degenerate learning agent (moreover, it can be desirable to make it non-dogmatic about observables in some formal sense), the hypotheses in the prior can also be ordinary (Baysian) POMDPs.

Given such a prior plus a time discount function, we can consider the corresponding infra-Bayesian agent (or even just Bayesian agent if we chose all hypothesis to be Bayesian). Such an agent optimizes rewards which depend on latent variables, even though it does not know the correct world-model in advance. It does fit the world to the immutable ontological model (which is necessary to make sense of the latent variables to which the reward function refers), but the ontological model has enough freedom to accommodate many possible worlds.

The next question is then how would we transfer such a utility function from the user to the AI. Here, like noted by Demski [LW · GW], we want the AI to use not just the user's utility function but also the user's prior. Because, we want running such an AI to be rational from the subjective perspective of the user. This creates a puzzle: if the AI is using the same prior, and the user behaves nearly-optimally for their own prior (since otherwise how would we even infer the utility function and prior), how can the AI outperform the user?

The answer, I think, is via the AI having different action/observation channels from the user. At first glance this might seem unsatisfactory: we expect the AI to be "smarter", not just to have better peripherals. However, using Turing RL [LW(p) · GW(p)] we can represent the former as a special case of the latter. Specifically, part of the additional peripherals is access to a programmable computer, which effectively gives the AI a richer hypothesis space than the user.

The formalism I outlined here leaves many questions, for example what kind of learning guarantees to expect in the face of possible ambiguities between observationally indistinguishable hypothesis[1]. Nevertheless, I think it creates a convenient framework for studying the question raised in the post. A potential different approach is using infra-Bayesian physicalism [LW · GW], which also describes agents with utility functions that depend on latent variables. However, it is unclear whether it's reasonable to apply the later to humans.


  1. See also my article "RL with imperceptible rewards" [LW · GW] ↩︎

comment by Gunnar_Zarncke · 2020-11-20T01:08:40.421Z · LW(p) · GW(p)

You could learn the pointers by observing how the model is incrementally built over time. There is much more explicit in children learning. Compare to how our modern values being very different from our ancestors'. 

comment by Ericf · 2020-11-19T18:39:21.290Z · LW(p) · GW(p)

With reference specifically to this:

The happiness of people I will never interact with is a good example of this. There may be people in the world whose happiness will not ever be significantly influenced by my choices. Presumably, then, my choices cannot tell us about how much I value such peoples’ happiness. And yet, I do value it.

and without considering any other part of the structure, I have an alternate view:

It is possible to determine if and how much you value the happiness (or any other attribute) of people you will never interact with by calculating

  1. What are the various things you, personally, could have done in the past [time period], and how would they have affected each of the people, plants, animals, ghosts, etc. that you might care about?
  2. What things did you actually do?
  3. How far away from your maximum impact / time were you for each entity you could have affected. (scaled in some way tbd)
  4. Derive values and weights from that. For example, if I donate $100 to Clean Water for Africa, that implies that I care about Clean Water & Africa more than I care about AIDS and Pakistan, and the level there depends on how much $100 means to me. If that's ten (or even two) hours of work to earn it that's a different level of commitment than if it represents 17 minutes of owning millions in assets.
  5. Run the calculation for all desired moral agents, to average out won't-ever-see-them effects.
Replies from: StellaAthena, johnswentworth
comment by StellaAthena · 2020-11-21T14:44:48.673Z · LW(p) · GW(p)
  1. Derive values and weights from that. For example, if I donate $100 to Clean Water for Africa, that implies that I care about Clean Water & Africa more than I care about AIDS and Pakistan, and the level there depends on how much $100 means to me. If that's ten (or even two) hours of work to earn it that's a different level of commitment than if it represents 17 minutes of owning millions in assets.

This will very quickly lead to incorrect conclusions, because people don’t act according to their values (especially for things that don’t impact their day to day lives like international charity). The fact that you donated $100 to Clean Water for Africa does not mean that you value that more than AIDS in Pakistan. You personally may very well care about about clean water and/or Africa more than AIDS and/or Pakistan, but if you apply this sort of analysis writ large you will get egregiously wrong answers. Scott Alexander’s “Too Much Dark Money in Almonds” describes one facet of this rather well.

Another facet is that how goods are bundled matters. Did I spend $15 on almonds because I value a) almonds b) nuts c) food d) sources of protein e) snacks I can easily eat while I drive f) snacks I can put out at parties... etc. And more importantly, which of those things do I care about more than I care about Trump losing the election?

Elizabeth Anscombe’s book Intention does a good job analyzing this. When we make actions, we are not making those actions based on the state of the world we are making those actions based on the state of the world under a particular description. One great example she gives is walking into a room and kissing a woman. Did you intend to a) kiss your girlfriend b) kiss the tallest women in the room c) kiss the woman closest to the door wearing pink d) kiss the person who got the 13th highest mark on her history exam last week e) ...

The answer is (typically) a. You intended to kiss your girlfriend. However to an outside observer who doesn’t already have a good model of humanity at large, if not a model of you in particular, it’s unclear how they’re supposed to tell that. Most people who donate to Clean Water for Africa don’t intend to be choosing that over AIDS in Pakistan. Their actions are consistent with having that intention, but you can’t derive intentionality from brute actions.

Replies from: Ericf
comment by Ericf · 2020-11-21T18:31:58.444Z · LW(p) · GW(p)

I agree with your comment, but I think it's a scale thing. If I analyze every time you walk into a room, and every time you kiss someone, I can derive that you kiss [specific person] when you see them after being apart. And this is already being done in corporate contexts with Deep Learning for specific questions, so it's just a matter of computing power, better algorithms, and some guidance at to the relevant questions and variables.

comment by johnswentworth · 2020-11-19T19:12:12.012Z · LW(p) · GW(p)

You've mostly understood the problem-as-stated, and I like the way you're thinking about it, but there's some major loopholes in this approach.

First, I may value the happiness of agents who I cannot significantly impact via my actions - for instance, prisoners in North Korea.

Second, the actions we chose probably won't provide enough data. Suppose there are n different people, and I could give any one of them $1. I value these possibilities differently (e.g. maybe because they have different wealth/cost of living to start with, or just because I like some of them better). If we knew how much I valued each action, then we'd know how much I valued each outcome. But in fact, if I chose person 3, then all we know is that I value person 3 having the dollar more than I value anyone else having it; that's not enough information to back out how much I value each other person having the dollar. This sort of underdetermination will probably be the usual result, since the choice-of-action contains a lot less bits than a function mapping the whole action space to values.

Third, and arguably most important: "run the calculation for all desired moral agents" requires first identifying all the "desired moral agents", which is itself an instance of the problem in the post. What the heck is a "moral agent", and how does an AI know which ones are "desired"? These are latent variables in your world-model, and would need to be translated to something in the real world.

Replies from: Ericf
comment by Ericf · 2020-11-20T14:51:06.294Z · LW(p) · GW(p)

I was attempting to answer the first point, so let me rephrase: Even though your ability to affect prisoners in North Korea is miniscule, we can still look at how much of it you're doing. Are you spending any time seeking out ways you could be affecting them? Are you voting for and supporting and lobbying politicians who are more likely to use their greater power to affect the NK prisoner's lives? Are you doing [unknown thing that the AI figures out would affect them]? And, also, are you doing anything that is making their situation worse? Or any other of the multiple axis of being, since happiness isn't everything, and even happiness isn't a one-dimentional scale.

"Who counts as a moral agent? (And should they all have equal weights)" Is a question of philosophy, which I am not qualified to answer. But "who gets to decide the values to teach" it's one meta-level up from the question of "how do we teach values", so I take it as a given for the latter problem.

Replies from: StellaAthena
comment by StellaAthena · 2021-06-20T12:55:25.384Z · LW(p) · GW(p)

This analysis falls apart when we take things to their logical extreme: I care about the happiness of human who are time-like separated from me.

comment by adamShimi · 2020-11-19T12:59:27.151Z · LW(p) · GW(p)

Really fascinating problem! I like how your examples make me want to say "Well, the AI just has to ask about... wait a minute, that's the problem!". Taken from another point of view, you're asking how and in which context can an AI reveal our utility functions, which means revealing our latent variables.

This problems also feels related to our discussion of the locality of goals. Here you assume a non-local goal (as most human ones are), and I think that a better knowledge of how to detect/measure locality from behavior and assumptions about the agent-model might help with the pointers problem.

Replies from: johnswentworth
comment by johnswentworth · 2020-11-21T01:48:34.699Z · LW(p) · GW(p)

Setting up the "locality of goals" concept: let's split the variables in the world model into observables , action variables , and latent variables . Note that there may be multiple stages of observations and actions, so we'll only have subsets  and  of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses  to maximize

... but we can rewrite that as

Defining a new utility function , the original problem is equivalent to:

In English: given the original utility function on the ("non-local") latent variables, we can integrate out the latents to get a new utility function defined only on the ("local") observation & decision variables. The new utility function yields completely identical agent behavior to the original.

So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the "local" observation & decision variables.

comment by romeostevensit · 2020-11-18T23:17:03.619Z · LW(p) · GW(p)

Over the last few posts the recurrent thought I have is "why aren't you talking about compression more explicitly?"

Replies from: johnswentworth
comment by johnswentworth · 2020-11-18T23:30:38.158Z · LW(p) · GW(p)

Could you uncompress this comment a bit please?

Replies from: romeostevensit
comment by romeostevensit · 2020-11-19T23:54:08.740Z · LW(p) · GW(p)

A pointer is sort of the ultimate in lossy compression. Just an index to the uncompressed data, like a legible compression library. Wireheading is a goodhearting problem, which is a lossy compression problem etc.

Replies from: PhilGoetz, Mo Nastri
comment by PhilGoetz · 2023-09-02T03:14:53.937Z · LW(p) · GW(p)

What do you mean by a goodhearting problem, & why is it a lossy compression problem?  Are you using "goodhearting" to refer to Goodhart's Law?

comment by Mo Putera (Mo Nastri) · 2021-07-12T07:33:15.019Z · LW(p) · GW(p)

My impression is that you consider this obvious, when in fact I found this an insightful framing. So thanks.

comment by PhilGoetz · 2023-09-02T03:07:35.808Z · LW(p) · GW(p)

I'll preface this by saying that I don't see why it's a problem, for purposes of alignment, for human values to refer to non-existent entities.  This should manifest as humans and their AIs wasting some time and energy trying to optimize for things that don't exist, but this seems irrelevant to alignment.  If the AI optimizes for the same things that don't exist as humans do, it's still aligned; it isn't going to screw things up any worse than humans do.

But I think it's more important to point out that you're joining the same metaphysical goose chase that has made Western philosophy non-sense since before Plato.

You need to distinguish between the beliefs and values a human has in its brain, and the beliefs & values it expresses to the external world in symbolic language.  I think your analysis concerns only the latter.  If that's so, you're digging up the old philosophical noumena / phenomena distinction, which itself refers to things that don't exist (noumena).

Noumena are literally ghosts; "soul", "spirit", "ghost", "nature", "essence", and "noumena" are, for practical purposes, synonyms in philosophical parlance.  The ghost of a concept is the metaphysical entity which defines what assemblages in the world are and are not instances of that concept.

But at a fine enough level of detail, not only are there no ghosts, there are no automobiles or humans.  The Buddhist and post-modernist objections to the idea that language can refer to the real world are that the referents of "automobiles" are not exactly, precisely, unambiguously,  unchangingly, completely, reliably specified, in the way Plato and Aristotle thought words should be.  I.e., the fact that your body gains and loses atoms all the time means, for these people, that you don't "exist".

Plato, Aristotle, Buddhists, and post-modernists all assumed that the only possible way to refer to the world is for noumena to exist, which they don't.  When you talk about "valuing the actual state of the world," you're indulging in the quest for complete and certain knowledge, which requires noumena to exist.  You're saying, in your own way, that knowing whether your values are satisfied or optimized requires access to what Kant called the noumenal world.  You think that you need to be absolutely, provably correct when you tell an AI that one of two words is better.  So those objections apply to your reasoning, which is why all of this seems to you to be a problem.

The general dissolution of this problem is to admit that language always has slack and error.  Even direct sensory perception always has slack and error.  The rationalist, symbolic approach to AI safety, in which you must specify values in a way that provably does not lead to catastrophic outcomes, is doomed to failure for these reasons, which are the same reasons that the rationalist, symbolic approach to AI was doomed to failure (as almost everyone now admits).  These reasons include the fact that claims about the real world are inherently unprovable, which has been well-accepted by philosophers since Kant's Critique of Pure Reason.

That's why continental philosophy is batshit crazy today.  They admitted that facts about the real world are unprovable, but still made the childish demand for absolute certainty about their beliefs.  So, starting with Hegel, they invented new fantasy worlds for our physical world to depend on, all pretty much of the same type as Plato's or Christianity's, except instead of "Form" or "Spirit", their fantasy worlds are founded on thought (Berkeley), sense perceptions (phenomenologists), "being" (Heidegger), music, or art.

The only possible approach to AI safety is one that depends not on proofs using symbolic representations, but on connectionist methods for linking mental concepts to the hugely-complicated structures of correlations in sense perceptions which those concepts represent, as in deep learning.  You could, perhaps, then construct statistical proofs that rely on the over-determination of mental concepts to show almost-certain convergence between the mental languages of two different intelligent agents operating in the same world.  (More likely, the meanings which two agents give to the same words don't necessarily converge, but agreement on the probability estimates given to propositions expressed using those same words will converge.)

Fortunately, all mental concepts are over-determined.  That is, we can't learn concepts unless the relevant sense data that we've sensed contains much more information than do the concepts we learned.  That comes automatically from what learning algorithms do.  Any algorithm which constructed concepts that contained more information than was in the sense data, would be a terrible, dysfunctional algorithm.

You are still not going to get a proof that two agents interpret all sentences exactly the same way.  But you might be able to get a proof which shows that catastrophic divergence is likely to happen less than once in a hundred years, which would be good enough for now.

Perhaps what I'm saying will be more understandable if I talk about your case of ghosts.  Whether or not ghosts "exist", something exists in the brain of a human who says "ghost".  That something is a mental structure, which is either ultimately grounded in correlations between various sensory perceptions, or is ungrounded.  So the real problem isn't whether ghosts "exist"; it's whether the concept "ghost" is grounded, meaning that the thinker defines ghosts in some way that relates them to correlations in sense perceptions.  A person who thinks ghosts fly, moan, and are translucent white with fuzzy borders, has a grounded concept of ghost.  A person who says "ghost" and means "soul" has an ungrounded concept of ghost.

Ungrounded concepts are a kind of noise or error in a representational system.  Ungrounded concepts give rise to other ungrounded concepts, as "soul" gave rise to things like "purity", "perfection", and "holiness".  I think it highly probable that grounded concepts suppress ungrounded concepts, because all the grounded concepts usually provide evidence for the correctness of the other grounded concepts.  So probably sane humans using statistical proofs don't have to worry much about whether every last concept of theirs is grounded, but as the number of ungrounded concepts increases, there is a tipping point beyond which the ungrounded concepts can be forged into a self-consistent but psychotic system such as Platonism, Catholicism, or post-modernism, at which point they suppress the grounded concepts.

Sorry that I'm not taking the time to express these things clearly.  I don't have the time today, but I thought it was important to point out that this post is diving back into the 19th-century continental grappling with Kant, with the same basic presupposition that led 19th-century continental philosophers to madness.  TL;DR:  AI safety can't rely on proving statements made in human or other symbolic languages to be True or False, nor on having complete knowledge about the world.

Replies from: johnswentworth
comment by johnswentworth · 2023-09-02T03:48:46.568Z · LW(p) · GW(p)

You need to distinguish between the beliefs and values a human has in its brain, and the beliefs & values it expresses to the external world in symbolic language.  I think your analysis concerns only the latter.

My analysis was intended to concern primarily the former; I care about the latter almost exclusively insofar as it provides evidence relevant to the former.

comment by Charlie Steiner · 2020-11-18T18:39:15.558Z · LW(p) · GW(p)

I think that one of the problems in this post is actually easier in the real world than in the toy model.

In the toy model the AI has to succeed by maximizing the agent's True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent's model of reality to be wrong in places.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world - we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense.

Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans (and we might have desiderata about how it does that modeling and what the end results should contain), and satisfy the modeled values. And in some ways this is actually a bit reassuring, because I'm pretty sure that it's possible to get better final results on this problem than on than learning the toy model agent's True Values - maybe not in the most simple case, but as you add things like lack of introspection, distributional shift, meta-preferences like identifying some behavior as "bias," etc.

EDIT FROM THE FUTURE: Stumbled across this comment 10 months later, and felt like my writing style was awkward and hard to understand - followed by a "click" where suddenly it became obvious and natural to me. Now I worry everyone else gets the awkward and hard to understand version all the time.

Replies from: johnswentworth
comment by johnswentworth · 2020-11-18T19:31:31.237Z · LW(p) · GW(p)

This comment seems wrong to me in ways that make me think I'm missing your point.

Some examples and what seems wrong about them, with the understanding that I'm probably misunderstanding what you're trying to point to:

we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense

I have no idea why this would be tied to non-Cartesian-ness.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world

There are certainly ways in which humans diverge from Bayesian utility maximization, but I don't see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that's quite different from having multiple distinct world-models.

Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.

How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.

One way to interpret all this is that you're pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don't see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).

Replies from: Charlie Steiner
comment by Charlie Steiner · 2020-11-18T23:28:37.070Z · LW(p) · GW(p)

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

If you don't agree with me on this, why didn't you reply when I spent about six months just writing posts that were all variations of this idea? Here's Scott Alexander making the basic point.

It's like... is there a True rational approximation of pi? Well, 22/7 is pretty good, but 355/113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there's the arbitrarily large "approximation" that is 3.141592... Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it's approximately one bajillion.

  • we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense

I have no idea why this would be tied to non-Cartesian-ness.

If a Cartesian agent was talking about their values, they could just be like "you know, those things that are specified as my values in the logic-stuff my mind is made out of." (Though this assumes some level of introspective access / genre savviness that needn't be assumed, so if you don't want to assume this then we can just say I was mistaken.). When a human talks about their values they can't take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we're breaking down the world into categories like "human behavior."

  • Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.

How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.

Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post [LW · GW] for basically my argument for what we're going to have to do with that wrong-seeming.

Replies from: johnswentworth
comment by johnswentworth · 2020-11-19T00:18:53.683Z · LW(p) · GW(p)

Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post [LW · GW] for basically my argument for what we're going to have to do with that wrong-seeming.

Before I get into the meat of the response... I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human's world-model, which still gives rise to all the same problems as a total order in the human's world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents [LW · GW]. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)

Anyway, I don't think that's the main disconnect here...

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

Ok, I think I see what you're saying now. I am of course on board with the notion that e.g. human values do not make sense when we're modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.

However, there is a difference between "the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction", and "there is not a unique, well-defined referent of 'human values'". The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human's model.

An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its "goals" do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different "goals" to the robot - e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.

And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. "The utility function I hard-coded into the robot" is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that "the utility function I hard-coded into the robot" would correspond to some unambiguous thing in the real world - although specifying exactly what that thing is, is an instance of the pointers problem.

Does that make sense? Am I still missing something here?

comment by Jesse Richardson (SharkoRubio) · 2023-05-18T05:11:35.194Z · LW(p) · GW(p)

Furthermore, human values are over the “true” values of the latents, not our estimates - e.g. I want other people to actually be happy, not just to look-to-me like they’re happy.

I'm not sure that I'm convinced of this. I think when we say we value reality over our perception it's because we have no faith in our perception to stay optimistically detached from reality. If I think about how I want my friends to be happy, not just appear happy to me, it's because of a built-in assumption that if they appear happy to me but are actually depressed, the illusion will inevitably break. So in this sense I care not just about my estimate of a latent variable, but what my future retroactive estimates will be. I'd rather my friend actually be happy than be perfectly faking it for the same reason I save money and eat healthy - I care about future me. 

What about this scenario: my friend is unhappy for a year while I think they're perfectly happy, then at the end of the year they are actually happy but they reveal to me they've been depressed for the last year. Why is future me upset in this scenario, why does current me want to avoid this? Well because latent variables aren't time-specific, I care about the value of latent variables in the future and the past, albeit less so. To summarize: I care about my own happiness across time and future me cares about my friend's happiness across time, so I end up caring about the true value of the latent variable (my friend's happiness). But this is an instrumental value, I care about the true value because it affects my estimates, which I care about intrinsically.

comment by Aditya Prasad (aditya-prasad-1) · 2022-10-07T18:20:43.882Z · LW(p) · GW(p)

human values are over the “true” values of the latents, not our estimates - e.g. I want other people to actually be happy, not just to look-to-me like they’re happy.

 

But this is not what our current value system is, we did not evolve such a pointer. Humans will be happy if their senses are deceived. The value system we have is currently over our estimates and that is exactly why we can be manipulated. It is just that till now we did not have an intelligence trying to adversarially fool us. So the value function we need to imbibe is one we don't even have an existence proof of.

 

I found this post really useful to clarify what the outer alignment problem really was. Like others mentioned in the comments I think we should give up predictive power for the AI adopting our world model, there would be a lot of value to be unpacked and the predictive power will still be far better than anything humans have seen now, maybe later one day we can figure out how to align an AI which is allowed to form their own more powerful world model. 

Current methods seem to be applying optimisation pressure to maximise predictive power which will push the AI away from adopting human like world models. 

It seems to come down to how do you traverse the ladder of abstraction, when some things you value are useful rather than true beliefs.