On the plausibility of a “messy” rogue AI committing human-like evil

jacob-griffith

On the plausibility of a “messy” rogue AI committing human-like evil

post by Jacob Griffith (Jacob.Griffith) · 2025-03-27T18:06:45.505Z · LW · GW · 0 comments

No comments

There can be an inherent illogic to human evil. In AI safety, it is important to consider whether a rogue AI agent is capable of committing this type of evil, and if so, what kind of agent might pose a risk of doing so. I draw on work from Corin Katzke and Joseph Carlsmith to explore the power-seeking tendencies of AI, focusing particularly on Carlsmith’s “messy” and “clean” goal directedness model, to conceptualise the nature in which an AI agent may acquire power-seeking characteristics. I then look at instances of catastrophic power-seeking carried out by humans: genocides, asserting that genocide is a product of “messy” goal directed-ness, and following this I question if the nature of this power-seeking could be similar to that which an AI agent might develop. My overall argument is that deficiencies in the cognitive architecture of AI agents may act as a gateway for such agents to commit ‘human-like’ evil.

Theories of agency

Corin Katzke conceptualises transformative AI in a manner which grapples with the fundamental questions of what it means to be an agent, how AI agents may develop power-seeking tendencies, and how power-seeking may become an intrinsic, overarching goal for a superintelligent AI agent. In doing so, he draws on Joseph Carlsmith’s framework [AF · GW] for categorising the goal directed-ness of AI agents on a scale running from “messy” to “clean”. Using this model, Katzke suggests that the “cleaner” an agent is, the more likely that it will develop power-seeking attributes.

To better understand this distinction between “messy” and “clean” it is necessary to explain the role of heuristics in cognitive function. Heuristics are cognitive shortcuts which learn from patterns and regularities in environments in order to make inferences about new aspects of an environment. This allows an agent to simplify interpretations of their environment and effectively make decisions within it. For example, humans tend to be good at picking up on body language, and interpreting body language to make judgements about the mood or temperament of another person. These judgements are not calculated by analysing every single aspect of a person’s appearance, but rather by recognising certain gestures and facial expressions which hold similarities to previously witnessed gestures in past social situations. It is through reference to patterns or repetitions in social settings that humans can make inferences about new situations. Patterns within an environment offer the basis to make an assumption about an aspect of an environment based on previous knowledge of another similar situation.

An example in a AI model would be a chess playing agent. The agent would not process every possible move it could make when deciding an action. Instead, it might rely on pattern recognition as a tactical heuristic. Because the neural network which powers the agent has been trained on millions of chess games, it can reference from previously similar patterns in previous games and draw on the moves which led to successful outcomes in these past situations. Making an inference based on the situational context – in this case the chess game scenario – allows for the chess agent to make effective decisions in an achievable and timely manner.

Heuristics are the hallmark for how we can understand a “messy” agent and its limitations vis-à-vis a “clean” agent. The key characteristics of heuristics is that they are domain specific. The inferences a model will make based on the specific context of their environment will likely not prove useful if applied to a completely new environment. A “messier” agent would rely on context specific heuristics to make decisions, and these decisions would naturally suit the situations presented in a specific domain but would not prove useful across every domain the agent might act in. Meanwhile, a hypothetically “clean” agent might avoid using heuristics all together, instead applying general purpose reasoning in all environments to facilitate superintelligent-level function across multiple domains. This agent might pursue a single goal and optimise all its decisions across all domains to cater for this specific, overarching goal.

AI agents developing power-seeking characteristics

The extent to which an agent draws upon heuristics for decision making may influence the extent to which an agent begins to develop power-seeking characteristics through the pursuit of instrumentally convergent goals. Nick Bostrom defines instrumentally convergent goals as:

“Instrumental goals whose attainment would increase the chances of the agent’s goal being realised for a wide range of final goals and a wide range of situations (Bostrom, 2012, P.6)”

An instrumental goal is one which if achieved by an agent, would likely bring about the realisation of this agent’s intrinsic (core) goal. An instrumental goal becomes convergent when, as Bostrom states, it is the means to reach many different intrinsic goals across many different situations. It has been argued that power is an instrumentally convergent goal. Power is broadly defined as the amount of possible actions an agent is able to make; the more actions an agent can make, the better position they are to act in a way which realises their goal. Thus, the attainment of power brings about the conditions by which an agent is more likely to achieve their intrinsic goal.

Because of the instrumental value of power, many intelligent agents have developed power-seeking tendencies. Wolves are power-seeking in the sense that they might seek dominance in a pack, and humans are power-seeking in the sense that people strive to become wealthy. The agents in question tend to pursue power-seeking goals in specific environments but not across every situational context in which they find themselves. A wolf might seek dominance within a pack but will not attempt to control all resources in its ecosystem. A human may possess a goal of being wealthy but choose to divorce their partner and lose half their wealth. These contrasting actions across different domains are contradictory in relation to the achievement of a specific goal. Like explained earlier, these kind of context specific decisions are indicative of “messy” agents.

Assuming that power is an instrumentally convergent goal, it is rational to be power-seeking as a means of achieving one’s function as an agent. While a “messier” agent would likely not recognise this rationality due to its reliance on context specific heuristics, a “cleaner” agent would be far more likely to realise the rationality of power-seeking and internalise it as a core goal. Using the chess example, a hypothetically “clean” agent with a goal of winning chess games would carry forward this goal to domains outside of a chess environment, potentially hijacking servers in order to be able to play as many chess games as possible.

It is essential to note that notions of “clean” and “messy” are by no means absolute. A hypothetically “clean” agent might be unconstrained by the inconsistencies and contradictions – held in place by evolutionary and cognitive limitations – which typify human action. But this is not to say that humans, or “messy” agents on a whole, have no capacity to be dangerously power-seeking. Indeed, humans probably lie somewhere along the scale where they have enough “cleanness” to be disposed toward power-seeking. Carlsmith writes [AF · GW]:

“Importantly, this sort of messy-ness is quite compatible with worries about alignment, power-seeking, etc – witness, for example, humans. (Carlsmith, 2023)”

This leads us to an important consideration now that we have laid out a framework: in what ways is human evil “messy” and what implications does this have for how we may frame the risk of a “messy” misaligned AI agent?

Genocide as messy power-seeking

Genocides are human instances of power-seeking leading to catastrophic ends and, ultimately, typified by illogic. Using Carlsmith’s framework, we can view these events as a product of “messy” goal directedness, characterised by contradictions between decisions made in varying domains.

The Holocaust

The Holocaust exhibited power-seeking in the sense that it enforced near total domination over its targets; the European Jewry was close to being exterminated, millions of people were rounded up from across the Europeans continent in a mass action to coordinate their murder. Theorist Hannah Arendt posited[1] that the Totalitarian Nazi regime held a key objective of continuous expansion. As Arendt frames it, the final solution was the means to reach an idealised and fictitious end; the nature of totalitarianism is that this end did not exist, and what substantiated the Nazi regime was not realising a final goal but rather the constant pursuit of power, or put differently, the pursuit of the means to reach a goal rather than a goal itself.

The ’goal’ of the Holocaust was premised by conspiracy, anchored by a racist view of the Jews and Sinti Roma and disabled. Because it was embedded in an ideological worldview, it was inherently irrational and its realisation would accrue no actual benefit to the agent pursuing it. Power-seeking existed despite being illogical. Humans can be seen as rational and “clean” enough to successfully oversee the massively complex task of coordinating mass murder and genocide, yet simultaneously as fundamentally contradictory and irrational precisely because they had a desire to carry out such a task.

The Cambodian genocide

Let us take a look at another example: the Cambodian genocide. Pol Pot and the Khmer Rouge regime took absolute control over Cambodia and attempted to completely cleanse and reset the country. They envisioned a utopian agrarian society starting again from ‘year zero’, one in which western influence was dismantled and traditional forms of Buddhism eradicated. In the process, thousands of people deemed enemies of the state were sent to the killing fields and murdered; whether they actually committed the crimes they were accused of was irrelevant, to admit oneself as guilty meant to contribute to the greater, overarching goal of the regime[2]. The Khmer Rouge depopulated cities and established rural communes in which the nation would farm rice of sustenance. Middle class city dwellers were deemed internal enemies and thus given the most toiling work on the farms, and resultantly, fleets of untrained farmers meant that Cambodia was unable to grow enough rice to feed its population. It is estimated that a quarter of the populations perished in the famine.

The Cambodian genocide is similarly typified by its irrational ideological underpinnings, however differs slightly in the sense that the famine was not an explicit aim of the Khmer Rouge in the same way that the Holocaust was an explicit aim of the Nazis. It was through contradictory actions across varying domains: uprooting society to create an agrarian state, yet placing farming responsibilities on a largely untrained subsect of the population, that the majority of lives were lost.

What does this mean for “messy’ AI?

The takeaway is that human evil does not make sense. Are we outliers as agents? Are these acts which have wreaked havoc on society uniquely tied to our specific configuration as humans? Or conversely, are they the product, more generally, of “messy” goal directedness, of a power-seeking which emerges within an agent. A power-seeking which is pursued in some critical domains, and which is so devastating because of its irrationality? The Jews posed no real threat, nor did the Cambodian middle class, nonetheless they were isolated as targets of genocide. Might it be that there lies a point on the hypothetical “messy” – “clean” scale upon which an agent is “clean” enough to develop devastating power-seeking tendencies, but too “messy” to coordinate this power-seeking to achieve a tangible and rational goal? If such a point exists, it is certainly inhabited by humans and may well be a space which a rogue AI agent will occupy in the future.

[1] See: Arendt, Hannah. The origins of totalitarianism. (1958).

[2] For excellent work on the Cambodian genocide see: Chandler, David. Voices from S-21: Terror and history in Pol Pot's secret prison. University of California Press, 2023.

0 comments

Comments sorted by top scores.

On the plausibility of a “messy” rogue AI committing human-like evil

Contents

0 comments