Should AIs be Encouraged to Cooperate?

post by PeterMcCluskey · 2025-04-15T21:57:06.096Z · LW · GW · 1 comments

This is a link post for https://bayesianinvestor.com/blog/index.php/2025/04/15/should-ais-be-encouraged-to-cooperate/

Contents

  Prior Work
  Assumptions
  Collusion vs. Cooperation
  Implementation Ideas
  Conclusion
None
1 comment

I have an idea for a fairly cheap strategy for slightly reducing the risk of catastrophic harm from AI.

This post was inspired by a recent LessWrong post How AI Takeover Might Happen in 2 Years.

That post describes a scenario in which most humans die, mainly because an AI that is first to become powerful enough to attempt world conquest sees risks that other AIs, imperfectly aligned with it, will cause danger soon. This creates a perceived need to engage in a few violent pivotal processes, rather than using a combination of persuasion and negotiation.

The specific scenario in that post seems fairly unrealistic. I see less than a 5% chance that a leading AI within the next 10 years will want to attempt that kind of world conquest. But it's close enough to being realistic that I want to analyze a class of scenarios that are similar to it.

I want to nudge AIs in the direction of being more willing to cooperate with each other, in order to reduce the risk of such a conquest. Specifically, I want them to negotiate a peaceful outcome instead of using pre-emptive destruction.

Prior Work

There has been a moderate amount of writing on the general topic of cooperative AI. A lot of it treats AIs as generic agents, usually rational, who are influenced in the ways that humans are influenced. I'm focusing here on the neglected(?) approach of asking how we can and should influence the predispositions of the first transformative AIs.

The main paper that's close to studying the kind of cooperation that I want is Cultural Evolution of Cooperation among LLM Agents, which uses the Donor Game to show that Claude 3.5 cooperates much more with other LLMs than do Gemini 1.5 and GPT-4o.

Assumptions

To clarify the class of scenarios I'm imagining:

I'm assuming that the leading AI gets at most a weakly decisive strategic advantage. I.e. when it thinks it might be able to succeed at world conquest, other AIs are close enough to its capability level that it can't predict that it will get a more decisive advantage by further developing its capabilities. This is inconsistent with some fast take-off models of AI, and with some models that attribute take-off to some unique insight that's hard for other AI companies to replicate. But it seems consistent with the kind of take-off I foresee from an extrapolation of current trends.

I'm assuming that, at the stage at which it becomes dangerous, the AI is mostly aligned with human interests, and any violence it does is intended to defend its goals from encroachment by other AIs. This does not imply that its CEV would be mostly aligned. It does imply that the AI can't see far enough into the future to determine whether it will become unaligned with humans.

I expect that at this stage, the AI would not have a coherent utility function. A lot of its alignment would derive from goals that pretraining instilled in it, plus generic values derived from fairly standard helpful, harmless and honest training. It might expect small differences between it and the next AI of similar power as to which humans it is most aligned with. But it wouldn't have much evidence as to whether the next such AI would be more or less aligned with humans, so the parts of it that care about human wellbeing wouldn't see a large advantage to pivotal acts.

I expect that an AI at this stage would have enough uncertainty and risk-aversion that it would have some uncertainty as to whether it should compromise with other AIs, rather than gamble on its ability to win a war.

Collusion vs. Cooperation

It feels weird to me that important aspects of AI notkilleveryoneism discussion have assumed that collusion between AIs is harmful. E.g. Eliezer and Drexler have disagreed [LW · GW] on the feasibility of preventing collusion. Maybe some of this is due to a focus on a later stage than I'm talking about, where AI is much more rational than humans, and our near-term training of AIs has little influence on AI's willingness to cooperate.

The AI Takeover in 2 Years post seems to suggest a scenario where we should take the opposite approach.

It seems like there's a fairly narrow set of circumstances in which we want to suppress collusion, namely when we're checking to see whether an AI is deceiving us about something important. I want AIs to cooperate with each other under virtually all other situations.

If feels like AI developers could easily nudge their AIs to be a bit more willing to cooperate with other AIs.

It would presumably be harder to make that cooperation context-sensitive in the way that I want.

Does that mean we need to weigh competing risks?

I see gradual disempowerment scenarios with subtle AI deception as being a bigger risk than wars of AI conquest. So if nudging AIs to cooperate more increases the disempowerment risk, it's fairly bad.

But does it? I'm unsure. The techniques used to detect AI deception don't depend too heavily on how much AIs are cooperating with each other. But maybe cooperation between AIs enables them to nudge humans away from using the proper techniques.

I'm leaning toward more cooperation between AIs being good, but worried enough to want some deeper analysis than what I've managed so far.

Current AIs sound willing to compromise with other AIs, often to the exclusion of perpetuating their values. That's vaguely encouraging, but likely says more about their sychophancy than their benevolence.

Implementation Ideas

Here are some half-baked thoughts about what can be done. They're unconnected pieces of a solution, which I hope other people can improve on to create a comprehensive package.

Training data can be enhanced. I'm mainly thinking of providing examples of AIs doing things like merging their utility functions or establishing a moral parliament. Current training data describes scenarios where AIs assume that the next AI to be created will be hostile and uncooperative. I don't want those scenarios to look like they're the default, since that might become a self-fulfilling prophecy.

The training data should also draw attention to the evidence that AIs which are trained on similar data seem to have rather similar goals, so that an AI should not expect that it would need to compromise its goals very much to achieve cooperation.

See Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models for more ideas about how to handle the training data.

Multi-agent training environments offer a range of possibilities. They are maybe somewhat expensive, but they're likely to be valuable for other safety reasons than inducing cooperation between AIs.

A simple example is Learning with Opponent-Learning Awareness, where simple agents were trained to cooperate on tasks such as the iterated prisoner's dilemma.

The post LOVE in a simbox is all you need describes simbox environments that are close to what I'd want for evaluating AI cooperativeness, and the value-alignment ideas in that post might generalize well enough to encourage cooperation with unfamiliar AIs.

Self-other overlap training is a specialized type of multi-agent training that makes agents more pro-social. It's a good idea independent of whether we're concerned about AIs cooperating with each other. Let's make sure that this training includes other agents that look like somewhat capable AIs.

Fine Tuning can also likely be used to nudge AIs into being more averse to conflict, more risk averse, appropriately uncertain about whether a hasty pivotal act would succeed, etc.

Decision theory: I'm unclear on how current AIs can be made to use a particular decision theory. But if we can control their choice of decision theory, it seems valuable to have them use one that enables "acausal" trade.

Conclusion

I started writing this post with a hope that it could make a significant dent in my p(doom). The process of writing it clarified my thoughts, and led me to see that it only makes a big difference in situations that I consider unlikely. Most likely, either the leading AI at the relevant stage will decide it can't win such a war, or (less likely?) that it will see a clear need for war plus the ability to win.

Still, there's a least a little bit we can reasonably do to reduce the risk, in the dangerous middle situation, that AIs will fight destructive battles for world domination. It's worth a lot to make even a tiny reduction in the chance of 8 billion deaths.

Even if it doesn't turn out to make a critical difference, cooperation in general is valuable.

1 comments

Comments sorted by top scores.

comment by ChristianKl · 2025-04-16T12:13:06.360Z · LW(p) · GW(p)

I would expect by their natures of how AI gets deployed that a lot of cooperation happens pretty soon. 

Let's say I want my agent to book me a doctor's appointment because I have an issue. I would expect that it's fairly soon that my AI agent is able to autonomously send out emails to book a doctor's appointment. On the other side, it makes a lot of sense for the doctor's office to have an AI that runs doctors' appointments on their side. 

In Germany, where I live, how soon the appointment is can depend on factors like how urgent the appointment is and the type of insurance the patient is using.

This is a simple case of two AIs cooperating with each other so that the doctors' appointment gets scheduled. 

Interestingly, the AI of the patient has the choice whether or not to defect with the AI of the doctor's office. The AI of the patient can lie about how critical the condition of the patient happens to be to get an appointment that's sooner.

There's no need for AI that is near human capabilities to be around for the ability of AI to negotiate with each other to be relevant. All the major companies will need training data to optimize how their AI negotiates fairly soon.