Making AIs less likely to be spiteful

post by Nicolas Macé (NicolasMace), Anthony DiGiovanni (antimonyanthony), JesseClifton · 2023-09-26T14:12:06.202Z · LW · GW · 4 comments

Contents

        Key takeaways 
  How spite might exacerbate catastrophic risks from AI
  How spite might arise in AI systems
  Interventions to prevent spite
    When can we shape a misaligned AI’s goals?
      High path-dependence
      Low path-dependence
  Other conflict-prone preferences
  Appendix: Notes on spite in humans 
  Acknowledgements 
  References
None
4 comments

Which forms of misalignment might result in particularly bad outcomes? And to what extent can we prevent them even if we fail at ​​intent alignment? We define spite as a terminal preference for frustrating others’ preferences, at least under some conditions. Reducing the chances that an AI system is spiteful is a candidate class of interventions for reducing risks of AGI conflict [? · GW], as well as risks from malevolence [EA · GW]. This post summarizes some of our thinking on the topic. We give an overview of why spite might lead to catastrophic conflict; how we might intervene to reduce it; ways in which the intervention could fail to be impactful, or have negative impact; and things we could learn that would update us on the value of this intervention.

Key takeaways 

  1. Spiteful preferences include a generalized preference for harming others, as well as other preferences like vengefulness and spite towards certain groups. The basic reason to focus on reducing spite is that such interventions may stably make AIs less likely to take risks of mutually costly conflict (or deliberately create suffering because they intrinsically value it), even if alignment fails. (more [LW · GW])
  2. Spite might be selected for in ML systems because (a) it serves as a strategically valuable commitment device, (b) it is a direct proxy for high-scoring behavior in environments where the optimal behavior involves harming other agents (e.g., environments with competition between agents), (c) it is (correctly or incorrectly) inferred from human preferences, or (d) it results from miscellaneous generalization failures. (more [LW · GW])
  3. Thus potentially low-cost interventions to reduce the chances of spite include modifications to the training data or loss function to reduce selection pressure towards spite (e.g., avoiding selecting agents based on relative performance in multi-agent tasks, or filtering human feedback that could select for spite). (more [LW · GW])
    1. Reducing spite carries some prima facie backfire risk, via potentially increasing the exploitability of AIs that share human values to some extent. We currently don’t think that this consideration makes the sign of spite reduction negative, however. One reason is that, for interventions to backfire by making our AI more exploitable, they have to change other agents’ beliefs about how spiteful our AI is, and there are various reasons to doubt this will happen. 
    2. Interventions to reduce spite seem most likely to be counterfactual in worlds where alignment fails. However, it’s currently very unclear to us how an AI’s goals will relate to design features we can intervene on (e.g., training environments), conditional on alignment failure. 
      1. If we are in a world in which inner objectives are highly path-dependent [AF · GW], we need it to be the case that (i) we can reliably influence motivations formed early in training and that (ii) these motivations reliably influence the agent’s final goals. (more [LW · GW])
      2. If we are in a world in which inner objectives are not very highly path-dependent [AF · GW], we need it to be the case that deceptive alignment isn’t favored by models’ inductive biases, and that the agent’s inner objective generalizes off-distribution as intended. (more [LW · GW])
         

Our future work: We are not confident that it will be possible to reduce spite in misaligned AIs. However, (i) based on our reading of the alignment literature and conversations with alignment researchers, it doesn’t seem strongly ruled out by the current understanding of alignment, and (ii) there may be spite interventions that are particularly cheap and easy to persuade AI developers to implement. Thus we think it’s worthwhile to continue to devote some of our portfolio to spite for now. 

Our conceptual work on this topic will likely include:

Here are things that we could learn (both through our own research and outside research findings) that would update us on the value of more work in this area: 

How spite might exacerbate catastrophic risks from AI

We’ll use this definition of spite: 

Spite (informal): An agent is spiteful if, at least under some conditions, they are intrinsically motivated to frustrate other agents’ preferences.   

Examples include: 

(See the appendix [LW · GW] for more notes on spiteful behavior in humans, including possible evolutionary mechanisms and analogs they might have in ML training.)  

There are two mechanisms by which spite might exacerbate risks of catastrophic outcomes from AI: 

  1. Making agents more likely to engage in (costly-to-them) conflict that leads to destruction of value rather than making a peaceful settlement. Conflict between AIs can significantly reduce the value of the future when at least one of the AIs involved is sufficiently intent-aligned (such that the AIs don’t simply ignore human preferences regardless of the conflict). As discussed in this post [LW · GW], costly conflict happens due to a risk-reward tradeoff that agents make under uncertainty about their counterpart’s bargaining policy. Spite can exacerbate conflict by making the costs of conflict lower and/or the rewards of winning a conflict higher. (Note however that spite does not by definition increase the expected costs of conflict.)
    1. Specifically, since a spiteful agent intrinsically values the frustration of their counterpart’s preferences, they find it less subjectively costly when a conflict destroys material resources for both agents. And similarly, they find it more rewarding when they win a conflict because the counterpart ends up with a lower share of the resources.
    2. Similarly, spite could make aligned AIs less likely to cooperate with each other to form alliances that can deter misaligned AIs.
  2. Making agents terminally value harming other agents under some conditions.

The reason for focusing on spite is that preferences are likely to be stickier than other aspects of an agent’s policy (because agents will generally work to preserve their preferences). Thus even if alignment fails [AF · GW], if we are able to exhibit some stable influence on the spitefulness of agents’ preferences we might be able to indefinitely reduce the chances that they engage in destructive conflict, or cause harmful outcomes that they intrinsically value. (We discuss stability more below [LW · GW]).

How spite might arise in AI systems

We are concerned that AI systems might acquire a spiteful inner objective [? · GW]. We have thought about four mechanisms by which spiteful inner objectives might be selected for in a system trained with ML.  

Interventions to prevent spite

The above mechanisms suggest a few ways that we can modify AI training to reduce the chances that AI systems are spiteful. 

A few problems need to be solved to develop good interventions here: 

We think reducing spitefulness is moderately more likely to be net-positive than net-negative. The main positive intended effects were discussed above [LW · GW]. A potential backfire risk is that aligned AIs who are perceived as less spiteful are also likely to be offered worse deals by (misaligned) competitors. However, we expect this effect to be relatively small due to other agents systematically underreacting to increases in spite — that is, their behavior not depending much on (and in particular, not being deterred from conflict by) our interventions on our AI’s degree of spite. This is because other agents may either lack information about the factors determining the agent’s preferences, or have commitments that are upstream of any interventions on an agent’s spitefulness. Also, if we are correct in thinking that these interventions are most likely to be counterfactual in cases where alignment fails, then it seems more likely that we will be intervening on misaligned AIs, in which case this backfire risk doesn’t apply.

An important step in developing interventions would be to develop measures for evaluating conflict-relevant aspects of AI systems’ behavior.

When can we shape a misaligned AI’s goals?

As we said above, the reason for focusing on preferences that are conducive to catastrophic AI risk is that preferences are more likely than other aspects of an agent’s policy to remain stable over time. But it is prima facie difficult to shape an AI’s preferences at all conditional on it eventually being misaligned. Here we summarize some considerations bearing on whether we can predictably influence agents’ goals under “low alignment power”,[5] and what we might learn that could update us on the feasibility of doing so.   

Given that we think measures to prevent spite are most likely to make a difference in worlds where alignment fails, let’s assume a scenario similar to that in Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover [LW · GW]: some rudimentary safety measures may be in place, but developers lack the motivation and/or ability to reliably prevent the AI from acquiring a misaligned goal, “playing the training game”, and taking over. Then, following this [AF · GW] Hubinger post, let’s look at how an AI’s preferences might be formed in “high-” and “low path-dependence worlds”. 

High path-dependence

In the high path-dependence world, things happen in this order:

  1. The AI’s policy is initially made up of proxy motivations  which develop over time , eventually allowing it to get high reward.   
  2. At some time , some of these proxies “crystallize” into a relatively coherent goal .
  3. The AI has a good understanding of the training process at this point, and works out that the best thing to do to fulfill  is to act in a way that gets it high training reward, so that it can eventually get out into the real world.

We are trying to shape the training process such that the proxies  and therefore (hopefully) the goal  are less spiteful. For example, we want to avoid agents learning proxies like “harm other agents under such-and-such circumstances” which might get crystallized into their final goals. For this to work, the following need to be true: 

It seems to us that the main way work directed at low alignment power + high path-dependence scenarios could be made obsolete is if the effect of any modification we could make to the training process either doesn’t affect the final goal at all or doesn’t affect it in any way we can predict. What could we learn about our ability to predictably influence an agent’s final goals that should convince us to abandon this line of work? Some ideas:  

Low path-dependence

In the low path-dependence world, all that matters are the optima of the training loss and the inductive biases of the model. In these worlds, it may still be possible for us to reduce the chances of a spiteful agent via our choice of loss function. For the purposes of this discussion, let’s say that the agent is inner-aligned to the training objective if it is intrinsically motivated to pursue some natural-to-us extrapolation of the loss function. Then we might be able to make a difference by choosing a less spite-conducive loss function, even if the loss function is outer-misaligned overall. For example, training in adversarial environments (e.g., in which the highest-scoring agent is selected) is arguably a case where inner alignment with the training objective could lead to spitefulness (cf. spite as a direct proxy for reward) [LW · GW]. And so reducing the amount of training in adversarial environments might reduce spite in low path-dependence worlds.

Other conflict-prone preferences

Spite isn’t the only kind of motivation that (i) plausibly exacerbates risks of AGI conflict and (ii) we could plausibly prevent even if alignment fails. Risk-seeking and excessive concern for honor or status are others, for example. We’ll likely think more about these, too.  

Appendix: Notes on spite in humans 

Here are some (not comprehensive) notes on apparent spitefulness in humans, including potential explanations for these preferences. Learning more about spite in humans might alert us to spiteful tendencies in human-provided training data, or evolutionary pressures towards spite that might have analogs in ML training.  

Acknowledgements 

Thanks to Lukas Finnveden, Daniel Kokotajlo, Maxime Riché, and Filip Sondej for feedback on this document.

References

Baumard, Nicolas. 2010. “Has Punishment Played a Role in the Evolution of Cooperation? A Critical Review.” Mind & Society 9 (2): 171–92.

Bester, Helmut, and Werner Güth. 1998. “Is Altruism Evolutionarily Stable?” Journal of Economic Behavior & Organization 34 (2): 193–209.

Bolle, Friedel. 2000. “Is Altruism Evolutionarily Stable? And Envy and Malevolence?: Remarks on Bester and Güth.” Journal of Economic Behavior & Organization 42 (1): 131–33.

Caputo, Andrea. 2013. “A Literature Review of Cognitive Biases in Negotiation Processes.” International Journal of Conflict Management 24 (4): 374–98.

Choi, Jung-Kyoo, and Samuel Bowles. 2007. “The Coevolution of Parochial Altruism and War.” Science 318 (5850): 636–40.

Dekel, Eddie, Jeffrey C. Ely, and Okan Yilankaya. 2007. “Evolution of Preferences.” The Review of Economic Studies, 74(3), 685–704.  

Güth, Werner, and Hartmut Kliemt. 1998. “The Indirect Evolutionary Approach: Bridging The Gap Between Rationality And Adaptation.” Rationality And Society 10 (3): 377–99.

Heifetz, Aviad, and Ella Segev. 2004. “The Evolutionary Role of Toughness in Bargaining.” Games and Economic Behavior 49 (1): 117–34.

Heifetz, Aviad, Chris Shannon, and Yossi Spiegel. 2007. “The Dynamic Evolution of Preferences.” Economic Theory 32 (2): 251–86.

Jackson, Joshua Conrad, Virginia K. Choi, and Michele J. Gelfand. 2019. “Revenge: A Multilevel Review and Synthesis.” Annual Review of Psychology 70 (January): 319–45.

Konrad, Kai A., and Florian Morath. 2012. “Evolutionarily Stable in-Group Favoritism and out-Group Spite in Intergroup Conflict.” Journal of Theoretical Biology 306 (August): 61–67.

Marlowe, Frank W., J. Colette Berbesque, Clark Barrett, Alexander Bolyanatz, Michael Gurven, and David Tracer. 2011. “The ‘Spiteful’ Origins of Human Cooperation.” Proceedings. Biological Sciences / The Royal Society 278 (1715): 2159–64.

Meegan, Daniel V. 2010. “Zero-Sum Bias: Perceived Competition despite Unlimited Resources.” Frontiers in Psychology 1 (November): 191.

Possajennikov, Alex. 2000. “On the Evolutionary Stability of Altruistic and Spiteful Preferences.” Journal of Economic Behavior & Organization 42 (1): 125–29.

Rusch, Hannes. 2014. “The Evolutionary Interplay of Intergroup Conflict and Altruism in Humans: A Review of Parochial Altruism Theory and Prospects for Its Extension.” Proceedings. Biological Sciences / The Royal Society 281 (1794): 20141539.

Tratner, Adam, and Melissa McDonald. 2019. “Genocide and the Male Warrior Psychology.” Confronting Humanity at Its Worst: Social Psychological Perspectives on Genocide, 1.

  1. ^

    This “commitment device” explanation for why humans have certain kinds of preferences — including altruism and spite — has been studied in the literature on “indirect evolutionary game theory” (Bester and Güth 1998; Güth and Kliemt 1998; Bolle 2000; Possajennikov 2000; Heifetz and Segev 2004). Indirect evolutionary game theory models agents as rationally optimizing for their subjective preferences (their “inner objective”), and these preferences are subject to selection based on the fitness (“outer objective”) they induce when optimized.

  2. ^

    I.e., direct performance on some tasks in the given environment.

  3. ^

    “Strategic substitution” is a property of payoff functions under which increasing “input” from one player reduces the marginal benefit of “input” from the other player. For example, in a game of Chicken, the greater the chance that I Dare, the lower your marginal returns to increases in the probability that you Dare. Environments with strategic substitution can select for agents that are spiteful (Possajennikov 2000; Heifetz, Shannon, and Spiegel 2007).  

  4. ^

    It seems fairly clear that we should prevent AIs from carrying out punishments because they want to harm the transgressor per se, as opposed to because they want to deter bad behavior. Deciding what punishments are appropriate for the purposes of deterrence may be nontrivial. For example, in principle we might want to train AIs to have updateless [? · GW] punishment policies, but in practice this may be difficult (especially in the “low alignment power” regime where we think spite-reducing interventions are most likely to be counterfactual). 

  5. ^

    “Alignment power” refers to the degree of influence humans have over AIs’ behavior.

  6. ^

     We should expect that proxies based on other agents’ preferences are relatively more likely when other agents’ preferences are more directly observable. In indirect evolutionary game theory models, other-regarding preferences are not selected for if agents’ preferences are unobserved (Dekel et al. 2007).

  7. ^

     The ability to predict the stabilized goal from proxies seems to be a point of contention around shard theory, where one criticism is that even if we can shape an agent’s initial “shards”, the way these are resolved into a stable, coherent goal will be highly unpredictable (e.g., see Nate Soares’ posts here [AF · GW] and here [AF · GW]). In our view the discourse thus far seems too underdeveloped to draw strong conclusions either way.

  8. ^

    “Even though deterrence has been proposed as a major reason for the evolution of retaliatory aggression, it is unclear whether people actually take revenge with deterrence in mind (see Osgood 2017). When explicitly asked to justify their revenge, people will cite deterrence motives (Darley & Pittman 2003) and will report feeling better about revenge when it has affected a positive change (Funk et al. 2014). However, there is also evidence that these self-reports are post hoc rationalizations rather than true motives. For example, when people are calculating the appropriate severity of retributive punishment, they are more attuned to whether the punishment matches the original transgression than whether it deters future harm (Carlsmith & Darley 2008). People playing economic games will also take revenge when they know they will not encounter their partner again, which would not make sense if revenge was solely intended to deter future harm (Fehr & Gachter 2000). Interestingly, vengeful people often feel less safe from future harm ¨ than do nonvengeful people (Akın & Akın 2016), indicating that people do not commonly take revenge because they think it will protect them. This evidence suggests that, of the many proximal predictors of revenge, deterrence may be among the least influential.”

4 comments

Comments sorted by top scores.

comment by Lukas_Gloor · 2023-09-27T08:07:02.871Z · LW(p) · GW(p)

Great post; I suspect this to be one of the most tractable areas for reducing s-risks! 

I also like the appendix. Perhaps this is too obvious to state explicitly, but I think one reason why spite in AIs seems like a real concern is because it did evolve somewhat prominently in (some) humans. (And I'm sympathetic [LW(p) · GW(p)] to the shard theory approach to understanding AI motivations, so it seems to me that the human evolutionary example is relevant. (That said, it's only analogous if there's a multi-agent-competition phase in the training, which isn't the case with LLM training so far, for instance.))

comment by David Althaus (wallowinmaya) · 2023-09-29T12:19:41.868Z · LW(p) · GW(p)

Really great post! 

It’s unclear how much human psychology can inform our understanding of AI motivations and relevant interventions but it does seem relevant that spitefulness correlates highly (Moshagen et al., 2018, Table 8, N  1,261) with several other “dark traits”, especially psychopathy (r = .74), sadism (r = .59), and Machiavellianism (r = .59). 

(Moshagen et al. (2018) therefore suggest that “[...] dark traits are specific manifestations of a general, basic dispositional behavioral tendency [...] to maximize one’s individual utility— disregarding, accepting, or malevolently provoking disutility for others—, accompanied by beliefs that serve as justifications.”)

Plausibly there are (for instance, evolutionary) reasons for why these traits correlate so strongly with each other, and perhaps better understanding them could inform interventions to reduce spite and other dark traits (cf. Lukas' comment [LW(p) · GW(p)]). 

If this is correct, we might suspect that AIs that will exhibit spiteful preferences/behavior will also tend to exhibit other dark traits (and vice versa!), which may be action guiding. (For example, interventions that make AIs less likely to be psychopathic, sadistic, Machiavellian, etc. would also make them less spiteful, at least in expectation.)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-07-12T18:35:11.575Z · LW(p) · GW(p)

Thanks for doing this! I think this is a promising line of research and I look forward to seeing this agenda developed!

comment by Review Bot · 2024-07-12T20:46:03.232Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?