Half-baked idea: a straightforward method for learning environmental goals?

q-home

Half-baked idea: a straightforward method for learning environmental goals?

post by Q Home · 2025-02-04T06:56:31.813Z · LW · GW · 7 comments

  Explanation 1
    One naive solution
    One philosophical argument
    One toy example
  Explanation 2
    Formalization
None
7 comments

Epistemic status: I want to propose a method of learning environmental goals (a super big, super important subproblem in Alignment). It's informal, so has a lot of gaps. I worry I missed something obvious, rendering my argument completely meaningless. I asked LessWrong feedback team, but they couldn't get someone knowledgeable enough to take a look.

Can you tell me the biggest conceptual problems of my method? Can you tell me if agent foundations [? · GW] researchers are aware of this method or not?

If you're not familiar with the problem, here's the context: Environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; Pointers Problem [LW · GW]; Eliciting Latent Knowledge [? · GW].

Explanation 1

One naive solution

Imagine we have a room full of animals. AI sees the room through a camera. How can AI learn to care about the real animals in the room rather than their images on the camera?

Assumption 1. Let's assume AI models the world as a bunch of objects interacting in space and time. I don't know how critical or problematic this assumption is.

Idea 1. Animals in the video are objects with certain properties (they move continuously, they move with certain relative speeds, they have certain sizes, etc). Let's make the AI search for the best world-model which contains objects with similar properties (P properties).

Problem 1. Ideally, AI will find clouds of atoms which move similarly to the animals on the video. However, AI might just find a world-model (X) which contains the screen of the camera. So it'll end up caring about "movement" of the pixels on the screen. Fail.

Observation 1. Our world contains many objects with P properties which don't show up on the camera. So, X is not the best world-model containing the biggest number of objects with P properties.

Idea 2. Let's make the AI search for the best world-model containing the biggest number of objects with P properties.

Question 1. For "Idea 2" to make practical sense, we need to find a smart way to limit the complexity of the models. Otherwise AI might just make any model contain arbitrary amounts of any objects. Can we find the right complexity restriction?

Question 2. Assume we resolved the previous question positively. What if "Idea 2" still produces an alien ontology humans don't care about? Can it happen?

Question 3. Assume everything works out. How do we know that this is a general method of solving the problem? We have an object in sense data (A), we care about the physical thing corresponding to it (B): how do we know B always behaves similarly to A and there are always more instances of B than of A?

One philosophical argument

I think there's a philosophical argument which allows to resolve Questions 2 & 3 (giving evidence that Question 1 should be resolvable too).

By default, we only care about objects with which we can "meaningfully" interact with in our daily life. This guarantees that B always has to behave similarly to A, in some technical sense (otherwise we wouldn't be able to meaningfully interact with B). Also, sense data is a part of reality, so B includes A, therefore there are always more instances of B than of A, in some technical sense. This resolves Question 3.
By default, we only care about objects with which we can "meaningfully" interact with in our daily life. This guarantees that models of the world based on such objects are interpretable [? · GW]. This resolves Question 2.
Can we define what "meaningfully" means? I think that should be relatively easy, at least in theory. There doesn't have to be One True Definition Which Covers All Cases.

If the argument is true, the pointers problem should be solvable without Natural Abstraction hypothesis [? · GW] being true.

Anyway, I'll add a toy example which hopefully helps to better understand what's this all about.

One toy example

You're inside a 3D video game. 1st person view. The game contains landscapes and objects, both made of small balls (the size of tennis balls) of different colors. Also a character you control.

The character can push objects. Objects can break into pieces. Physics is Newtonian. Balls are held together by some force. Balls can have dramatically different weights.

Light is modeled by particles. Sun emits particles, they bounce off of surfaces.

The most unusual thing: as you move, your coordinates are fed into a pseudorandom number generator. The numbers from the generator are then used to swap places of arbitrary balls.

You care about pushing boxes (as everything, they're made of balls too) into a certain location.

...

So, the reality of the game has roughly 5 levels:

The level of sense data (2D screen of the 1st person view).
A. The level of ball structures. B. The level of individual balls.
A. The level of waves of light particles. B. The level of individual light particles.

I think AI should be able to figure out that it needs to care about 2A level of reality. Because ball structures are much simpler to control (by doing normal activities with the game's character) than individual balls. And light particles are harder to interact with than ball structures, due to their speed and nature.

Explanation 2

An alternative explanation of my argument:

Imagine activities which are crucial for a normal human life. For example: moving yourself in space (in a certain speed range); moving other things in space (in a certain speed range); staying in a single spot (for a certain time range); moving in a single direction (for a certain time range); having varied visual experiences (changing in a certain frequency range); etc. Those activities can be abstracted into mathematical properties of certain variables (speed of movement, continuity of movement, etc). Let's call them "fundamental variables". Fundamental variables are defined using sensory data or abstractions over sensory data.
Some variables can be optimized (for a long enough period of time) by fundamental variables. Other variables can't be optimized (for a long enough period of time) by fundamental variables. For example: proximity of my body to my bed is an optimizable variable (I can walk towards the bed — walking is a normal activity); the amount of things I see is an optimizable variable (I can close my eyes or hide some things — both actions are normal activities); closeness of two particular oxygen molecules might be a non-optimizable variable (it might be impossible to control their positions without doing something weird).
By default, people only care about optimizable variables. Unless there are special philosophical reasons to care about some obscure non-optimizable variable which doesn't have any significant effect on optimizable variables.
You can have a model which describes typical changes of an optimizable variable. Models of different optimizable variables have different predictive power. For example, "positions & shapes of chairs" and "positions & shapes of clouds of atoms" are both optimizable variables, but models of the latter have much greater predictive power. Complexity of the models needs to be limited, by the way, otherwise all models will have the same predictive power.
Collateral conclusions: typical changes of any optimizable variable are easily understandable by a human (since it can be optimized by fundamental variables, based on typical human activities); all optimizable variables are "similar" to each other, in some sense (since they all can be optimized by the same fundamental variables); there's a natural hierarchy of optimizable variables (based on predictive power). Main conclusion: while the true model of the world might be infinitely complex, physical things which ground humans' high-level concepts (such as "chairs", "cars", "trees", etc.) always have to have a simple model (which works most of the time, where "most" has a technical meaning determined by fundamental variables).

Formalization

So, the core of my idea is this:

AI is given "P properties" which a variable of its world-model might have. (Let's call a variable with P properties P-variable.)
AI searches for a world-model with the biggest amount of P-variables. AI makes sure it doesn't introduce useless P-variables. We also need to be careful with how we measure the "amount" of P-variables: we need to measure something like "density" rather than "amount" (i.e. the amount of P-variables contributing to a particular relevant situation, rather than the amount of P-variables overall?).
AI gets an interpretable world-model (because P-variables are highly interpretable), adequate for defining what we care about (because by default, humans only care about P-variables).

How far are we from being able to do something like this? Are agent foundations researches pursuing this or something else?

7 comments

Comments sorted by top scores.

comment by Capybasilisk · 2025-02-05T11:10:44.114Z · LW(p) · GW(p)

You may be interested in this article:

Model-Based Utility Functions

Orseau and Ring, as well as Dewey, have recently described problems, including self-delusion, with the behavior of agents using various definitions of utility functions. An agent's utility function is defined in terms of the agent's history of interactions with its environment. This paper argues, via two examples, that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model. Basing a utility function on a model that the agent must learn implies that the utility function must initially be expressed in terms of specifications to be matched to structures in the learned model. These specifications constitute prior assumptions about the environment so this approach will not work with arbitrary environments. But the approach should work for agents designed by humans to act in the physical world. The paper also addresses the issue of self-modifying agents and shows that if provided with the possibility to modify their utility functions agents will not choose to do so, under some usual assumptions

Also, regarding this part of your post:

For example: moving yourself in space (in a certain speed range)

This range is quite huge. In certain contexts, you'd want to be moving through space at high fractions of the speed of light, rather than walking speed. Same goes for moving other objects through space. Btw, would you count a data packet as an object you move through space?

staying in a single spot (for a certain time range)

Hopefully the AI knows you mean moving in sync with Earth's movement through space.

Replies from: Q Home

↑ comment by Q Home · 2025-02-07T11:01:31.104Z · LW(p) · GW(p)

Thank you for actually engaging with the idea (pointing out problems and whatnot) rather than just suggesting reading material.

Btw, would you count a data packet as an object you move through space?

A couple of points:

I only assume AI models the world as "objects" moving through space and time, without restricting what those objects could be. So yes, a data packet might count.
"Fundamental variables" don't have to capture all typical effects of humans on the world, they only need to capture typical human actions which humans themselves can easily perceive and comprehend. So the fact that a human can send an Internet message at 2/3 speed of light doesn't mean that "2/3 speed of light" should be included in the range of fundamental variables, since humans can't move and react at such speeds.
Conclusion: data packets can be seen as objects, but there are many other objects which are much easier for humans to interact with.
Also note that fundamental variables are not meant to be some kind of "moral speed limits", prohibiting humans or AIs from acting at certain speeds. Fundamental variables are only needed to figure out what physical things humans can most easily interact with (because those are the objects humans are most likely to care about).

This range is quite huge. In certain contexts, you'd want to be moving through space at high fractions of the speed of light, rather than walking speed. Same goes for moving other objects through space.

What contexts do you mean? Maybe my point about "moral speed limits" addresses this.

Hopefully the AI knows you mean moving in sync with Earth's movement through space.

Yes, relativity of motion is a problem which needs to be analyzed. Fundamental variables should refer to relative speeds/displacements or something.

The paper is surely at least partially relevant, but what's your own opinion on it? I'm confused about this part: (4.2 Defining Utility Functions in Terms of Learned Models)

For example a person may be specified by textual name and address, by textual physical description, and by images and other recordings. There is very active research on recognizing people and objects by such specifications (Bishop, 2006; Koutroumbas and Theodoris, 2008; Russell and Norvig, 2010). This paper will not discuss the details of how specifications can be matched to structures in learned environment models, but assumes that algorithms for doing this are included in the utility function implementation.

Does it just completely ignore the main problem?

I know Abram Demski wrote about Model-based Utility Functions, but I couldn't fully understand [LW(p) · GW(p)] his post too.

(Disclaimer: I'm almost mathematically illiterate, except knowing a lot of mathematical concepts from popular materials. Halting problem, Godel, uncountability, ordinals vs. cardinals, etc.)

Replies from: Capybasilisk

↑ comment by Capybasilisk · 2025-02-07T23:49:31.928Z · LW(p) · GW(p)

Also note that fundamental variables are not meant to be some kind of “moral speed limits”, prohibiting humans or AIs from acting at certain speeds. Fundamental variables are only needed to figure out what physical things humans can most easily interact with (because those are the objects humans are most likely to care about).

Ok, that clears things up a lot. However, I still worry that if it's at the AI's discretion when and where to sidestep the fundamental variables, we're back at the regular alignment problem. You have to be reasonably certain what the AI is going to do in extremely out of distribution scenarios.

Replies from: Q Home

↑ comment by Q Home · 2025-02-08T09:53:16.077Z · LW(p) · GW(p)

The subproblem of environmental goals is just to make AI care about natural enough (from the human perspective) "causes" of sensory data, not to align AI to the entirety of human values. Fundamental variables have no (direct) relation to the latter problem.

However, fundamental variables would be helpful for defining impact measures if we had a principled way to differentiate "times when it's OK to sidestep fundamental variables" from "times when it's NOT OK to sidestep fundamental variables". That's where the things you're talking about definitely become a problem. Or maybe I'm confused about your point.

Replies from: Capybasilisk

↑ comment by Capybasilisk · 2025-02-10T09:29:43.863Z · LW(p) · GW(p)

Thanks. That makes sense.

comment by Charlie Steiner · 2025-02-11T21:16:01.577Z · LW(p) · GW(p)

So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."

Second problem comes in two flavors - object level and meta level. The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc. The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)

Another potential complication is the difficulty of integrating some features of this picture with modern machine learning. I think it's fine to do research that assumes a POMDP world model or whatever. But demonstrations of alignment theories working in gridworlds have a real hard time moving me, precisely because they often let you cheat (and let you forget that you cheated) on problems one and two.

Replies from: Q Home

↑ comment by Q Home · 2025-02-12T01:33:59.164Z · LW(p) · GW(p)

So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."

I assume we get an easily interpretable model where the difference between "real strawberries" and "pictures of strawberries" and "things sometimes correlated with strawberries" is easy to define, so we can use the model to directly pick the physical things AI should care about. I'm trying to address the problem of environmental goals, not the problem of teaching AI morals. Or maybe I'm misunderstanding your point?

The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc.

If you're talking about AI learning morals, my idea is not about that. Not about modeling desires and beliefs.

The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)

I disagree too, but in a slightly different way. IIRC, John says approximately the following:

All reasoning systems converge on the same space of abstractions. This space of abstractions is the best way to model the universe.
In this space of abstractions it's easy to find the abstraction corresponding to e.g. real diamonds.

I think (1) doesn't need to be true. I say:

By default, humans only care about things they can easily interact with in humanly comprehensible ways. "Things which are easy to interact with in humanly comprehensible ways" should have a simple definition.
Among all "things which are easy to interact with in humanly comprehensible ways", it's easy to find the abstraction corresponding to e.g. real diamonds.

Half-baked idea: a straightforward method for learning environmental goals?

Contents

Explanation 1

One naive solution

One philosophical argument

One toy example

Explanation 2

Formalization

7 comments