Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
post by johnswentworth, David Lorell · 2025-01-24T20:20:28.881Z · LW · GW · 34 commentsContents
The Cake The Restaurant Happy Instrumental Convergence? All The Way Up Research Threads None 34 comments
The Cake
Imagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake. I care about nothing else. If the oven ends up a molten pile of metal ten minutes after the cake is done, if the leftover eggs are shattered and the leftover milk spilled, that’s fine. Baking that cake is my terminal goal.
In the process of baking the cake, I check my fridge and cupboard for ingredients. I have milk and eggs and flour, but no cocoa powder. Guess I’ll have to acquire some cocoa powder! Acquiring the cocoa powder is an instrumental goal: I care about it exactly insofar as it helps me bake the cake.
My cocoa acquisition subquest is a very different kind of goal than my cake baking quest. If the oven ends up a molten pile of metal shortly after the cocoa is acquired, if I shatter the eggs or spill the milk in my rush to the supermarket, then that’s a problem - a molten oven or shattered eggs or spilled milk would make it harder for me to bake the cake! More generally, in the process of acquiring cocoa powder, I want to not mess up other things which are helpful for making the cake. Unlike my terminal goal of baking a cake, my instrumental goal of acquiring cocoa powder comes with a bunch of implicit constraints about not making other instrumental subgoals much harder.
(If you’re already thinking “hmm, that sounds kinda like corrigibility [? · GW]”, then you have the right idea and that is indeed where we’re going with this.)
Generalizable takeaway: unlike terminal goals, instrumental goals come with a bunch of implicit constraints about not making other instrumental subgoals much harder.
The Restaurant
Now imagine that I’m working as a chef in a big restaurant. My terminal goal is the restaurant’s long-term success; I care about nothing else. If the bombs drop, so long as the restaurant is still doing good business afterwards, I’ll be happy.
One day, a customer orders a fresh chocolate cake, and it falls to me to bake it. Now baking the cake is an instrumental goal.
One key difference from the previous example: in the restaurant, I don’t know all the things which future customers will order. I don’t know exactly which ingredients or tools will be needed tomorrow. So, in the process of baking the cake, I want to avoid wasting ingredients or destroying tools which might be useful for any of the dishes which future customers might order. My instrumental goal of baking a cake comes with a bunch of implicit constraints about not-making-harder a whole distribution of potential future instrumental subgoals.
Another key difference from the previous example: now there are multiple chefs, multiple subagents working on different instrumental subgoals. As part of the implicit constraints on my cake-baking, I need to not make their instrumental subgoals more difficult. And that notably brings in lots of informational constraints. For instance, if I use some eggs, I need to either put the rest of the eggs back in a location predictable to the other chefs, or I need to communicate to the other chefs where I left the eggs, so that they don’t have to spend time searching for the eggs later. So my instrumental goal of baking a cake comes with a bunch of constraints about being predictable to others, and/or making information about what I’m doing visible to others.
Generalizable takeaway: unlike terminal goals, instrumental goals come with implicit constraints about being predictable, making information about what one is doing visible, and not-making-harder a whole broad distribution of other possible instrumental goals.
… and now this sounds a lot like corrigibility.
Happy Instrumental Convergence?
Still sticking to the restaurant example: presumably many different instrumental goals in the restaurant require clean plates, empty counter space, and money. Those are all convergently instrumentally-useful resources within the restaurant.
Now, the way you might be used to thinking about instrumental convergence is roughly: “For lots of different goals in the restaurant, I need clean plates, empty counter space, and money. So, I might as well seize a bunch of those things upfront. Sure that’ll screw over the other chefs, but I don’t care about that.”. And that is how the reasoning might go if baking this one cake were a terminal goal.
But instrumental goals are different. If I’m the chef baking the cake as an instrumental goal, I instead reason: “For lots of different goals in the restaurant, a chef needs clean plates, empty counter space, and money. So, I should generally make sure those things are readily available to my fellow chefs as much as possible, so that they'll be able to solve their problems for our shared terminal goal. I’ll avoid using the resources up, and even make more of them available (by e.g. cleaning a counter top) whenever I have a relative advantage in doing so.”.
I want to emphasize that this sort of reasoning should require no “special sauce”. It’s just a natural, implicit part of instrumental goals, as opposed to terminal goals.
One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.
Suppose, in the restaurant example, that clean plates, empty counter space, and money are the only significant convergently instrumental goals. Then, (in the restaurant environment,) we get a natural notion of general corrigibility: if I just “try not to step on the toes” of instrumentally-convergent subgoals, then that will mostly keep me from stepping on the toes of most subgoals pursued by other restaurant-denizens, regardless of what our top-level goals are. The same strategy works for many different top-level goals in this restaurant, so it’s a generally corrigible strategy.
More generally, if I track instrumentally-convergent subgoals throughout the whole world, and generally "avoid stepping on the toes" of any of them... that would be a generally corrigible strategy.
And that unlocks the natural next jump.
All The Way Up
The natural next jump: do we even need the terminal goal at all? What if a mind’s top-level goals were the same “kind of thing” as instrumental goals more generally? Indeed, in some ways that would be a very natural structure for a general-purpose [LW · GW] mind; it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?
So long as instrumental convergence kicks in hard enough in the global environment, the mind can “try not to step on the toes” of instrumentally-convergent subgoals, and then that will mostly keep it from stepping on the toes of most other people's subgoals, regardless of the original terminal goal. So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.
For AI purposes, this would be a much safer kind of agent. It would be an AI which naturally tries not to “step on other agents’ toes”, naturally behaves such that it doesn’t get in the way of other agents’ goals (and in particular humans’ goals). But unlike e.g. naive formulations of “low-impact” agents, such an AI would also actively try to behave in ways predictable and legible to other agents, and make sure that other agents can easily query information about its own behavior.
In short, it sounds like all the properties of corrigibility we always hoped for, all coming from a single coherent underlying concept (i.e. not thrown together ad-hoc), and therefore likely to also generalize in ways we like to properties we haven’t yet thought to ask for.
Research Threads
This concept of corrigibility immediately suggests lots of research approaches.
First, on the theory side, there’s the problem of fleshing out exactly what the “type signature” of an instrumental goal is, with all those implicit constraints. The main way one would tackle this problem would be:
- Pick some class of optimization problems, and a way to break it into apparent “subproblems”.
- Work through some examples to check that the sort of phenomena we’re interested in actually do show up for that class of optimization problems and notion of “subproblems”.
- Explicitly spell out the “implicit constraints” of the subproblems in this formulation.
- Repeat for other formulations, and look for the generalizable patterns in how the implicit constraints of subproblems are naturally represented. Operationalize those patterns.
- Look for positive arguments that this operationalization of the relevant patterns is “the unique right way” to formulate things - like e.g. derivations from some simple desiderata, mediation in some class of algorithms, etc.
On the empirical side, one could try clever ways of training instrumental rather than terminal goals into a system. For instance, the restaurant example suggests training a system to work with many instances of itself or other systems in order to solve top-level goals in a reasonably general environment. Then, y’know… see what happens.
34 comments
Comments sorted by top scores.
comment by Max Harms (max-harms) · 2025-01-25T00:17:15.086Z · LW(p) · GW(p)
Some of this seems right to me, but the general points seem wrong. I agree that insofar as a subprocess resembles an agent, there will be a natural pressure for it to resemble a corrigible agent. Pursuit of e.g. money is all well and good until it stomps the original ends it was supposed to serve -- this is akin to a corrigibility failure. The terminal-goal seeking cognition needs to be able to abort, modify, and avoid babysitting its subcognition.
One immediate thing to flag is that when you start talking about chefs in the restaurant, those other chefs are working towards the same overall ends. And the point about predictability and visibility only applies to them. Indeed, we don't really need the notion of instrumentality here -- I expect that two agents that know the other to be working towards the same ends to naturally want to coordinate, including by making their actions legible to the other.
One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.
This is, I think, the cruxy part of this essay. Knowing that an agent won't want to build incorrigible limbs, so we should expect corrigibility as a natural property of (agentic) limbs isn't very important. What's important is whether we can build an AI that's more like a limb, or that we expect to gravitate in that direction, even as it becomes vastly more powerful than the supervising process.
(Side note: I do wish you'd talked a bit about a restaurant owner, in your metaphor; having an overall cognition that's steering the chefs towards the terminal ends is a natural part of the story, and if you deny the restaurant has to have an owner, I think that's a big enough move that I want you to spell it out more.)
So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.
I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.
But perhaps you mean you want to set up an agent which is serving the terminal goals of others? (The nearest person? The aggregate will of the collective? The collective will of the non-anthropomorphic universe?) If it has money in its pocket, do I get to spend that money? Why? Why not expect that in the process of this agent getting good at doing things, it learns to guard its resources from pesky monkeys in the environment? In general I feel like you've just gestured at the problem in a vague way without proposing anything that looks to me like a solution. :\
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T12:34:56.920Z · LW(p) · GW(p)
Pursuit of money is an extremely special instrumental goal whose properties you shouldn't generalize to other goals in your theory of instrumental convergence. (And I could imagine it should be narrowed down further, e.g. into those who want to support the state vs those who want money by whichever means including scamming the state.)
Replies from: max-harms↑ comment by Max Harms (max-harms) · 2025-01-25T18:23:27.970Z · LW(p) · GW(p)
Not convinced it's relevant, but I'm happy to change it to:
If it has matter and/or energy in its pocket, do I get to use that matter and/or energy?
↑ comment by tailcalled · 2025-01-25T20:08:41.675Z · LW(p) · GW(p)
Generally you wouldn't since it's busy using that matter/energy for whatever you asked it to do. If you wanted to use it, presumably you could turn down its intensity, or maybe it exposes some simplified summary that it uses to coordinate economies of scale.
comment by Lucius Bushnaq (Lblack) · 2025-01-24T20:43:35.459Z · LW(p) · GW(p)
On first read the very rough idea of it sounds ... maybe right? It seems to perhaps actually centrally engage with the source of my mind's intuition that something like corrigibility ought to exist?
Wow.
I'd love to get a spot check for flaws from a veteran of the MIRI corrigibility trenches.
comment by Garrett Baker (D0TheMath) · 2025-01-24T23:50:44.589Z · LW(p) · GW(p)
I will note this sounds a lot like Turntrout's old Attainable Utility Preservation [? · GW] scheme. Not exactly, but enough that I wouldn't be surprised if a bunch of the math here has already been worked out by him (and possibly, in the comments, a bunch of the failure-modes identified).
comment by mattmacdermott · 2025-01-24T21:00:01.335Z · LW(p) · GW(p)
For some reason I've been muttering the phrase, "instrumental goals all the way up" to myself for about a year, so I'm glad somebody's come up with an idea to attach it to.
comment by mattmacdermott · 2025-01-24T20:54:44.040Z · LW(p) · GW(p)
One time I was camping in the woods with some friends. We were sat around the fire in the middle of the night, listening to the sound of the woods, when one of my friends got out a bluetooth speaker and started playing donk at full volume (donk is a kind of funny, somewhat obnoxious style of dance music).
I strongly felt that this was a bad bad bad thing to be doing, and was basically pleading with my friend to turn it off. Everyone else thought it was funny and that I was being a bit dramatic -- there was nobody around for hundreds of metres, so we weren't disturbing anyone.
I think my friends felt that because we were away from people, we weren't "stepping on the toes of any instrumentally convergent subgoals" with our noise pollution. Whereas I had the vague feeling that we were disturbing all these squirrels and pigeons and or whatever that were probably sleeping in the trees, so we were "stepping on the toes of instrumentally convergent subgoals" to an awful degree.
Which is all to say, for happy instrumental convergence to be good news for other agents in your vicinity, it seems like you probably do still need to care about those agents for some reason?
Replies from: Lblack↑ comment by Lucius Bushnaq (Lblack) · 2025-01-24T21:06:02.482Z · LW(p) · GW(p)
Yes, I don't think this will let you get away with no specification bits in goal space at the top level like John's phrasing might suggest. But it may let you get away with much less precision?
The things we care about aren't convergent instrumental goals for all terminal goals, the kitchen chef's constraints aren't doing that much to keep the kitchen liveable to cockroaches. But it seems to me that this maybe does gesture at a method to get away with pointing at a broad region of goal space instead of a near-pointlike region.
comment by Algon · 2025-01-25T10:08:37.282Z · LW(p) · GW(p)
I was thinking about this a while back, as I was reading some comments by @tailcalled [LW · GW] where they pointed out this possibility of a "natural impact measure" when agents make plans. This relied on some sort of natural modularity in the world, and in plans, such that you can make plans by manipulating pieces of the world which don't have side-effects leaking out to the rest of the world. But thinking through some examples didn't convince me that was the case.
Though admittedly, all I was doing was recursively splitting my instrumental goals into instrumental sub-goals and checking if they wound up seeming like natural abstractions. If they had, perhaps that would reflect an underlying modularity in plan-making in this world that is likely to be goal-independent. They didn't, so I got more pessimistic about this endeavour. Though writing this comment out, it doesn't seem like those examples I worked through are much evidence. So maybe this is more likely to work than I thought.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T11:50:02.319Z · LW(p) · GW(p)
What I eventually realized is that this line of argument is a perfect rebuttal of the whole mesa-optimization neurosis that has popped up, but it doesn't actually give us AI safety because it completely breaks down once you apply it to e.g. law enforcement or warfare.
Replies from: Algon, sharmake-farah↑ comment by Algon · 2025-01-25T14:52:36.919Z · LW(p) · GW(p)
Could you unpack both clauses of this sentence? It's not obvious to me why they are true.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T17:18:21.201Z · LW(p) · GW(p)
For the former I'd need to hear your favorite argument in favor of the neurosis that inner alignment is a major problem.
For the latter, in the presence of adversaries, every subgoal has to be robust against those adversaries, which is very unfriendly.
↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T13:46:23.149Z · LW(p) · GW(p)
I agree this doesn't perfectly solve the AI safety problem, and my guess is that the reason this doesn't work for law enforcement/warfare is because the instrumental goals are adversarial, such that you are not incentivized to not break other agent's goals.
However, if something like the plan from John Wentworth's post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.
Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I'd encourage you to write up your results on that line of argument anyway.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T15:49:32.641Z · LW(p) · GW(p)
However, if something like the plan from John Wentworth's post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.
How?
Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I'd encourage you to write up your results on that line of argument anyway.
I didn't really get any further than John Wentworth's post here. But also I've been a lot less spooked by LLMs than Eliezer Yudkowsky.
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T16:02:07.631Z · LW(p) · GW(p)
How?
Basically, because you can safely get highly capable AIs to work on long and confusing problems without worrying that they'd eventually takeover and kill everyone, and this includes all plans for automating alignment.
Also, a crux here is I expect automating alignment research to be way less adversarial than fields like law enforcement/warfare, because you are facing way less opposition to your goals.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T16:23:25.817Z · LW(p) · GW(p)
If you want AIs to produce a lot of text on AI alignment and moral philosophy, you can already do that now without worrying that the AIs in question will take over the world.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can't really know how to adapt and improve it without actually involving it in those conflicts.
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T16:29:21.277Z · LW(p) · GW(p)
Ok, the key point I want to keep in mind is that for the purposes of AI alignment, we don't really need to solve most human conflicts, other than internally generated ones, because the traditional alignment problem is aligning an AI to a single human, so most of the political conflicts do not actually matter here.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T16:33:14.278Z · LW(p) · GW(p)
Most places where AI or alignment are applied are more convoluted cases where lots of people are involved. It's generally not economically feasible to develop AGI for a single person, so it doesn't really happen.
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T16:37:06.205Z · LW(p) · GW(p)
Agree with this, but the point here is a single (or at least a small set of people) have control over AI values by default, such that the AI is aligned to them personally, and it essentially treats other people according to the instructions/wishes of that single person/small set of people, which was my point in claiming that most conflicts don't matter, because they have a resolution procedure that is very simple to implement.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T16:40:42.778Z · LW(p) · GW(p)
I don't think the people develop AGI have clear or coherent wishes for how the AGI should treat most other people.
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T16:50:34.328Z · LW(p) · GW(p)
Agree with this, but 2 things:
-
I expect people to develop clearer and coherent wishes once they actually realize that they might have nation-state level power.
-
Most versions of incoherent/unclear wishes for other humans do not result in existential catastrophe, relative to other failure modes for AI safety.
↑ comment by tailcalled · 2025-01-25T17:09:31.327Z · LW(p) · GW(p)
I don't really understand how you expect this line of thought to play out. Are you arguing e.g. Sam Altman would start using OpenAI to enforce his own personal moral opinions, even when they are extremely unpopular?
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T17:22:52.392Z · LW(p) · GW(p)
This definitely can happen, though I'd argue in practice it wouldn't go as far as enforcing his own opinions by force, and to get back to what I wanted to argue, my point here is that instrumental goals leading to corrigibility, and in practice we will have instruction following AGIs/ASIs than value-aligned AGIs/ASIs:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than [LW · GW]
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T17:27:45.988Z · LW(p) · GW(p)
I don't understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
Replies from: sharmake-farah, nathan-helm-burger↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T17:38:31.618Z · LW(p) · GW(p)
I don't understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
The key property is we can reasonably trust their research not to have adversarial backdoors, and we can let our guard down quite a lot, and the pivotal act I usually envision has to do with automating the R&D pipeline, which then leads to automating the alignment pipeline, which leads to existential safety.
Note this doesn't look like a pivotal act, and this is not coincidental here, because real life heroism doesn't look like bombast/using hard power, it looks like being able to make a process more efficient like the Green Revolution, or preventing backfire risk such that you make the situation worse.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T17:41:00.636Z · LW(p) · GW(p)
I'm not interested in your key property, I'm interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time [LW · GW], but your description is kind of too vague to say for sure.
Replies from: sharmake-farah↑ comment by Noosphere89 (sharmake-farah) · 2025-01-25T18:20:15.878Z · LW(p) · GW(p)
I have to agree with this comment below by Matt Levinson below, that at least 3 of the specific failure modes described in the post can't be solved by any AI safety agenda, because they rely on the assumption that people will use the agenda, so there's no reason to consider them, and having read the discourse on that post, I think the main ways I disagree with John Wentworth is that I'm much more optimistic in general on verification, and do not find his worldview of verification not being easier than generation plausible at all, which leads to being more optimistic about something like a market of ideas for AI alignment working, and I think bureaucracies in general are way better than John Wentworth seems to imply.
This is also related to the experiment John did on whether markets reliably solve hard problems instead of goodharting by focusing on the air conditioner test, and my takeaway is that markets are actually sometimes good at optimizing things, and people just don't appreciate economic/computational constraints on why something is the way it is.
Comments below:
https://www.lesswrong.com/posts/8wBN8cdNAv3c7vt6p/the-case-against-ai-control-research#FembwXfYSwnwxzWbC [LW(p) · GW(p)]
https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#maJBX3zAEtx5gFcBG [LW(p) · GW(p)]
https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP#g4N9Pdj8mQioRe43q [? · GW]
The posts I disagree with:
https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP [? · GW]
https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general [LW · GW]
https://www.lesswrong.com/posts/MMAK6eeMCH3JGuqeZ/everything-i-need-to-know-about-takeoff-speeds-i-learned [LW · GW]
https://www.lesswrong.com/posts/hsqKp56whpPEQns3Z/why-large-bureaucratic-organizations [LW · GW]
(For the bureaucratic organizations point, I think the big reason why that neatly explains bureaucracy is a combo of needing to avoid corruption/bad states very highly, so simple, verifiable rules are best, combined with the world giving us problems that are hard to solve but easy to verify, plus humans needing to coordination).
So I'm much less worried about slop than John Wentworth is.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T20:03:55.034Z · LW(p) · GW(p)
If you're assuming that verification is easier than generation, you're pretty much a non-player when it comes to alignment.
↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-25T18:30:51.203Z · LW(p) · GW(p)
My new concept for "pivotal act that stops the world from getting to ASI, even though we get to AGI" is a soft-power act of better coordination. Get help from AGI to design and deploy decentralized governance tech that allows humanity (and AIs) to coordinate on escaping the trap of suicide-race.
Replies from: tailcalled↑ comment by tailcalled · 2025-01-25T20:06:09.583Z · LW(p) · GW(p)
Once you start getting involved with governance, you're going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.
comment by Daniel C (harper-owen) · 2025-01-25T03:11:39.072Z · LW(p) · GW(p)
I think one pattern which needs to hold in the environment [LW · GW] in order for subgoal corrigibility to make sense is that the world is modular, but that modularity structure can be broken or changed
For one, modularity is the main thing that enables general purpose search [LW · GW]: If we can optimize for a goal by just optimizing for a few instrumental subgoals while ignoring the influence of pretty much everything else, then that reflects some degree of modularity in the problem space
Secondly, if the modularity structure of the environment stays constant no matter what (e.g We can represent it as a fixed causal DAG), then there would be no need to "respect modularity" because any action we take would preserve the modularity of the environment by default (given our assumption); we would only need to worry about side effects if there's at least a possibility for those side effects to break or change the modularity of the problem space, and that means the modularity structure of the problem space is a thing that can be broken or changed
Example of modularity structure of the environment changing: Most objects in the world pretty much only have direct influence on other objects nearby, and we can break or change that modularity structure by moving objects to different positions. In particular, the positions are the variables which determines the modularity of "which objects influence which other objects", and the way that we "break" the modularity structure between the objects is by intervening on those variables.
So we know that "subgoal corrigibility" requries the environment to be modular, but that modularity structure can be broken or changed. If this is true, then the modularity structure of the environment can be tracked by a set of "second-order" variables such as position which tells us "what things influence what other things" (In particular, these second-order variables themselves might satisfy some sort of modularity structure that can be changed, and we may have third-order variables that tracks the modularity structure of the second-order variables). The way that we "respect the modularity" of other instrumental subgoals is by preserving these second-order variables that track the modularity structure of the problem space.
For instance, we get to break down the goal of baking a cake into instrumental subgoals such as acquiring coca powder (while ignoring most other things) if and only if a particular modularity structure of the problem space holds (e.g. other equipments are all in the right place & right positions), and there is a set of variables that track that modularity structure (the conditions & positions of the equipments). The way we preserve that modularity structure is by preserving those variables (the conditions & positions of the equipments).
Given this, we might want to model the world in a way that explicitly represents variables that track the modularity of other variables, so that we get to preserve influence over those variables (and therefore the modularity structure that GPS relies on)
comment by Knight Lee (Max Lee) · 2025-01-26T01:48:17.984Z · LW(p) · GW(p)
Instrumental goal competition explains the cake
My worry is that instrumental subgoals are not safer because instrumental subgoals are automatically safer, but because higher goals (which generate instrumental subgoals) tend to generate multiple instrumental subgoals, none of which is important enough to steamroll the others. This seems to explain the cake example.
If you want instrumental goals all the way up, it means you want to repeatedly convert the highest goal into an instrumental subgoal of an even higher goal, which in turn will generate many other instrumental subgoals to compete with it for importance.
I'm not sure, but it looks like the only reason this should work is if the AGI/ASI has so many competing goals that being good to humans has some weight. This is similar to Multi-Objective Homeostasis [LW · GW].
Goal Reductionism
I guess another way this may work, is if the AGI/ASI itself isn't sure why it's doing something, we can teach it to to think that its behaviours are the instrumental subgoal of some higher purpose, which it itself can't be sure about.
This is related to Goal Reductionism [LW · GW].
I feel that Self-Other Overlap: A Neglected Approach to AI Alignment [LW · GW] also fits the theme of the chef and restaurant example, and may help with Goal Reductionism.