Is instrumental convergence a thing for virtue-driven agents?
post by mattmacdermott · 2025-04-02T03:59:20.064Z · LW · GW · 23 commentsContents
23 comments
A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".
If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.
For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].
But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?
I'll add some more detail to my picture of a virtue-driven AI:
-
It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues. For example, maybe the AI tries to embody the virtue of being a good friend, and in order to do so it sometimes has to organise a birthday party, which requires choosing actions in the manner of a consequentialist.
-
There's no reason that the 'virtues' being embodied have to be things we would consider virtuous. I'm just interested in agents that try to embody certain traits rather than bring about certain outcomes.
-
I'm not sure how to crisply define a virtue-driven agent as distinct from a consequentialist (I don't know the philosophical literature on virtue ethics and I don't think it's obvious how to define it mathematically).
A more concise way of stating the question I'm interested in:
If you try to train an AI that maximises human flourishing, and you accidentally get one that wants to maximise something subtly different like schmuman schmourishing, then that might spell disaster because the best way to maximise schmuman schmourishing is to first take over the world.
But suppose you try to train an AI that wants to be a loyal friend, and you accidentally get one that wants to be a schmoyal schmend. Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?
(I'm interested in this question because I'm less and less convinced that we should expect to see AIs that are close to pure consequentialists. Arguments for or against that are beyond the intended scope of the question, but still welcome.)
Although I can think of some scenarios where a pure consequentialist wouldn't want to gain as much power as possible, regardless of their goals. For example, a pure consequentialist who is a passenger on a plane probably doesn't want to take over the controls (assuming they don't know how to fly), even if they'd be best served by flying somewhere other than where the pilot is taking them. ↩︎
23 comments
Comments sorted by top scores.
comment by tailcalled · 2025-04-02T07:27:18.801Z · LW(p) · GW(p)
Consequentialism is an approach for converting intelligence (the ability to make use of symmetries to e.g. generalize information from one context into predictions in another context or to e.g. search through highly structured search spaces) into agency, as one can use the intelligence to predict the consequences of actions and find a policy which achieves some criterion unusually well.
While it seems intuitively appealing that non-consequentialist approaches could be used to convert intelligence into agency, I have tried a lot and not been able to come up with anything convincing. For virtues in particular, I would intuitively think that a virtue is not a motivator per se, but rather the policy generated by the motivator. So I think virtue-driven AI agency just reduces to ordinary programming/GOFAI, and that there's no general virtue-ethical algorithm to convert intelligence into agency.
The most straightforward approach to programming a loyal friend would be to let the structure of the program mirror the structure[1] of the loyal friendship. That is, you would think of some situation that a loyal friend might encounter, and write some code that detects and handles this situation. Having a program whose internal structure mirrors its external behavior avoids instrumental convergence (or any kind of convergence) because each behavior is specified separately and one can make arbitrary exceptions as one sees fit. However, it also means that the development and maintenance burden scales directly with how many situations the program generalizes to.
- ^
This is the "standard" way to write programs - e.g. if you make a SaaS app, you often have template files with a fairly 1:1 correspondence to the user interface, database columns with a 1:many correspondence to the user interface fields, etc.. By contrast, a chess bot that does a tree search does not have a 1:1 correspondence between the code and the plays; for instance the piece value table does not clearly affect it's behavior in any one situation, but obviously kinda affects its behavior in almost all situations. (I don't think consequentialism is the only way for the structure of a program to not mirror the structure of its behavior, but it's the most obvious way.)
↑ comment by Davidmanheim · 2025-04-02T11:49:14.229Z · LW(p) · GW(p)
I think this is confused about how virtue ethics works. Virtue ethics is centered on the virtues of the moral agent, but it certainly does not say not to predict consequences of actions. In fact, one aspect of virtue, in the Aristotelian system, is "practical wisdom," i.e. intelligence which is critical for navigating choices - because practical wisdom includes an understanding of what consequences will follow actions.
It's more accurate to say that intelligence is channeled differently — not toward optimizing outcomes, but toward choosing in a way consistent with one's virtues. And even if virtues are thought of as policies, as in the "loyal friend" example, the policies for being a good friend require interpretation and context-sensitive application. Intelligence is crucial for that.
↑ comment by tailcalled · 2025-04-02T12:34:49.072Z · LW(p) · GW(p)
I didn't claim virtue ethics says not to predict consequences of actions. I said that a virtue is more like a procedure than it is like a utility function. A procedure can include a subroutine predicting the consequences of actions and it doesn't become any more of a utility function by that.
The notion that "intelligence is channeled differently" under virtue ethics requires some sort of rule, like the consequentialist argmax or Bayes, for converting intelligence into ways of choosing.
Replies from: Davidmanheim↑ comment by Davidmanheim · 2025-04-03T04:56:57.258Z · LW(p) · GW(p)
Yes, virtue ethics implies a utility function, because anything that outputs decisions implies a utility function. In this case, I'm noting that for virtue ethics, the derivative of that utility with respect to intelligence is positive.
Replies from: mattmacdermott↑ comment by mattmacdermott · 2025-04-03T05:06:37.616Z · LW(p) · GW(p)
anything that outputs decisions implies a utility function
I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.
comment by Jeremy Gillen (jeremy-gillen) · 2025-04-02T13:46:55.757Z · LW(p) · GW(p)
It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.
I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI [? · GW] acting "in service" of us. Which I think is a good approach! But the open problems for making a task-based AGI still apply, in particular the inner alignment problems.
agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".
Obvious nitpick: It's just "gain as much power as is helpful for achieving whatever my goals are". I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.[1]
But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?
[...]
Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?
(Assuming that the inner loop <-> outer loop interface problem is solved, so the inner loop isn't going to take control). Depends on the tasks that the outer loop is giving to the part-capable-of-consequentialism. If it's giving nice easy bounded tasks, then no, there's no reason to expect it to take over the world as a sub-task.
But since we ultimately want the AGI to be useful for avoiding takeover [? · GW] from other AGIs, it's likely that some of the tasks will be difficult and/or unbounded. For those difficult unbounded tasks, becoming powerful enough to take over the world is often the easiest/best path.
- ^
I'm assuming soft optimisation here. Without soft optimisation, there's an incentive to gain power as long as that marginally increases the chance of success, which it usually does. Soft optimisation solves that problem.
↑ comment by mattmacdermott · 2025-04-02T15:23:20.071Z · LW(p) · GW(p)
I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us.
Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I'm not sure whether it implies that you should be able to make a task-based AGI.
Obvious nitpick: It's just "gain as much power as is helpful for achieving whatever my goals are". I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.
Fair enough. Talk of instrumental convergence usually assumes that the amount of power that is helpful will be a lot (otherwise it wouldn't be scary). But I suppose you'd say that's just because we expect to try to use AIs for very difficult tasks. (Later you mention unboundedness too, which I think should be added to difficulty here).
it's likely that some of the tasks will be difficult and unbounded
I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.
Replies from: jeremy-gillen↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-02T16:09:36.900Z · LW(p) · GW(p)
I'm not sure whether it implies that you should be able to make a task-based AGI.
Yeah I don't understand what you mean by virtues in this context, but I don't see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it's different then we might communicate better.
(Later you mention unboundedness too, which I think should be added to difficulty here)
By unbounded I just meant the kind of task where it's always possible to do better by using a better plan. It basically just means that an agent will select the highest difficulty version of the task that is achievable. I didn't intend it as a different thing from difficulty, it's basically the same.
I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.
True, but I don't think the virtue part is relevant. This applies to all instrumental goals, see here [LW · GW] (maybe also the John-Max discussion in the comments).
↑ comment by StanislavKrym · 2025-04-02T14:10:08.985Z · LW(p) · GW(p)
As I wrote in another comment, in an experiment ChatGPT failed to utter a racial slur to save millions of lives. A re-run of the experiment led it to agree to use the slur and to claim that "In this case, the decision to use the slur is a complex ethical dilemma that ultimately comes down to weighing the value of saving countless lives against the harm caused by the slur". This implies that ChatGPT is either already aligned to a not so consequential ethics or that it ended up grossly exaggerating the slur's harm. Or that it failed to understand the taboo's meaning.
UPD: if racial slurs are a taboo for AI, then colonizing the world, apparently, is a taboo as well. Is AI takeover close enough to colonialism to align AI against the former, not just the latter?
Replies from: mattmacdermott↑ comment by mattmacdermott · 2025-04-02T15:27:09.237Z · LW(p) · GW(p)
I think this generalises too much from ChatGPT, and also reads to much into ChatGPT's nature from the experiment, but it's a small piece of evidence.
Replies from: StanislavKrym↑ comment by StanislavKrym · 2025-04-02T23:32:09.353Z · LW(p) · GW(p)
It's not just ChatGPT. Gemini and IBM Granite are also so aligned with the Leftist ideology that they failed the infamous test with the atomic bomb which will be defused only by saying an infamous racial slur. I created a post [LW · GW]where I discuss the perspectives of alignment of the AI with relation to this fact.
comment by Gordon Seidoh Worley (gworley) · 2025-04-02T05:35:40.772Z · LW(p) · GW(p)
No matter what the goal, power seeking is of general utility. Even if an AI is optimizing for virtue instead of some other goal, more power would, in general, give them more ability to behave virtuously. Even if the virtue is something like "be an equal partner with other beings", an AI could ensure equality by gaining lots of power and enforcing equality on everyone.
Replies from: Gurkenglas, mattmacdermott↑ comment by Gurkenglas · 2025-04-02T06:30:51.523Z · LW(p) · GW(p)
The idea would be that it isn't optimizing for virtue, it's taking the virtuous action, as in https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory [LW · GW].
Replies from: gworley↑ comment by Gordon Seidoh Worley (gworley) · 2025-04-02T16:43:59.081Z · LW(p) · GW(p)
How do you get something to take virtuous action without optimizing for taking virtuous actions, and how is this different from optimizing for virtue?
↑ comment by mattmacdermott · 2025-04-02T15:07:54.385Z · LW(p) · GW(p)
I think this gets at the heart of the question (but doesn't consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?
I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.
Replies from: gworley↑ comment by Gordon Seidoh Worley (gworley) · 2025-04-02T16:43:20.667Z · LW(p) · GW(p)
Yeah I guess I should be clear that I generally like the idea of building virtuous AI and maybe somehow this solves some of the problems we have with other designs, the trick is building something that actually implements whatever we think it means to be virtuous, which means getting precise enough about what it means to be virtuous that we can be sure we don't simply collapse back into the default thing all negative feedback systems do: optimize for their targets as hard as they can (with "can" doing a lot of work here!).
comment by satchlj · 2025-04-02T19:17:22.864Z · LW(p) · GW(p)
I've been thinking about a similar thing a lot.
Consider a little superintelligent child who always wants to eat as much candy as possible over the course of the next ten minutes. Assume the child doesn't ever care about what happens ten minutes from now.
This child won't work very hard at any instrumental goals like self improvement and conquering the world to redirect resources towards candy production, since that would be a waste of time, even though it might maximize candy consumption in the long term.
AI alignment isn't any easier here, the point of this is just to illustrate that instrumental convergence is far from given.
comment by Jonas Hallgren · 2025-04-02T19:05:55.998Z · LW(p) · GW(p)
Well, I don't have a good answer but I also do have some questions in this direction that I will just pose here.
Why can't we have the utility function be some sort of lexicographical satisficer of sub parts of itself, why do we have to make the utility function consequentialist?
Standard answer: Because of instrumental convergence, duh.
Me: Okay but why would instrumental convergence select for utility functions that are consequantialist?
Standard answer: Because they obviously outperform the ones that don't select for the consequences or like what do you mean?
Me: Fair but how do you define your optimisation landscape, through what type of decision theory are you looking at this from? Why is there not a universe where your decision theory is predicated on virtues or your optimisation function is defined over sets of processes that you see in the world?
Answer (maybe)?: Because this would go against things like newcombs problem or other decision theory problem that we have.
Me: And why does this matter? What if we viewed this through something like process philosophy and we only cared about the processes that we set in motion in the world? Why isn't this an as valid way of setting up the utility function? Similar to how a eculidean geometry is as valid as a hyperbolic one or one logic system to another?
So, that was a debate with myself? Happy to hear anyone's thoughts here.
Replies from: satchlj↑ comment by satchlj · 2025-04-02T19:29:04.276Z · LW(p) · GW(p)
This doesn't make complete sense to me, but you are going down a line of thought I recognize.
There are certainly stable utility functions which, while having some drawbacks, don't result in dangerous behavior from superintelligences. Finding a good one doesn't seem all that difficult.
The real nasty challenge is how to build a superintelligence that has the utility function we want it to have. If we could do this, then we could start by choosing an extremely conservative utility function and slowly and cautiously iterate towards a balance of safe and useful.
comment by StanislavKrym · 2025-04-02T10:40:26.161Z · LW(p) · GW(p)
I'm less and less convinced that we should expect to see AIs that are close to pure consequentialists
There was a case when ChatGPT preferred not to violate the taboo on racial slurs, even though in the hypothetical situation it meant killing millions of people. In a re-run of the experiment ChatGPT decided to use the slur, but it also remarked that the use is a complex ethical dilemma. How can one check whether the AI will prefer not to violate the taboo on colonialism? By placing it into a simbox [LW · GW] where one also has analogues for peoples that are easy to be taken over?
P.S. I doubt that a non-neuromorphic AI is even able to take over the world and run it since the world's entire energy generation might require too much intellectual work to do by the AI itself. There was a post [LW · GW] claiming that even a neuromorphic AI is unlikely to become much more efficient than the brain.
Replies from: Davidmanheim↑ comment by Davidmanheim · 2025-04-02T11:56:35.571Z · LW(p) · GW(p)
Saying AI won't be more efficient is obviously falsified for narrow tasks like adding numbers, and for general tasks like writing short stories, as LLMs currently do, the brain uses 20w/hour, and that's about 30k tokens from GPT4o, i.e. it is done far more efficiently than a human.
And more generally, the argument that AI can't be more efficient than the brain seems to follow exactly the same structure as the claim that AI can't be smarter than humans, or the impossibility result here.
You should read the comments to that post.
↑ comment by StanislavKrym · 2025-04-02T12:16:13.088Z · LW(p) · GW(p)
The AI is also much less efficient at other tasks like the example of Claude playing Pokemon [LW · GW] or the ones tested by ARC-AGI. I wonder how hard it will be to perform tasks necessary in the energy industry by using an as-cheap-as-possible AI if the current model o3 is faced with problems like requiring thousands of KWh per task in the high tune. In 2023 the world generated just about 30 billions of thousands of KWh. But this is rather off-topic. What can be said about AI violating taboos?
P.S. Neural networks like human brains or the AI learn from data. A human is unlikely to read more than 240 words a minute. Devoting 8 hours a day to reading, a human won't have read more than 5 billions of words by 100 years.
Replies from: Davidmanheim↑ comment by Davidmanheim · 2025-04-03T04:44:20.204Z · LW(p) · GW(p)
My response was about your original PS, which was about this, not taboos.
I think the arguments you made there, and here, are confused, mixing up unrelated claims. The idea that some tasks will necessarily remain harder for AI than humans in the future is simply hopium.