Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI
post by jsteinhardt · 2023-10-31T05:10:02.581Z · LW · GW · 0 commentsContents
1. Misalignment: The Difficulty of Controlling ML Systems Unwanted Drives 2. Misuse Misuse and Misalignment Conclusion None No comments
Given their advanced capabilities, future AI systems could pose significant risks to society. Some of this risk stems from humans using AI systems for bad ends (misuse), while some stems from the difficulty of controlling AI systems “even if we wanted to” (misalignment).
We can analogize both of these with existing risks. For misuse, we can consider the example of nuclear weapons, where the mass-production of hydrogen bombs created an existentially precarious situation. If the world’s arsenal of hydrogen bombs had been deployed in military conflict, we might well have destroyed society. AI might similarly enable nation-states to create powerful autonomous weapons, speed the research of other dangerous technologies like superviruses, or employ mass surveillance and other forms of control.
For misalignment, the best analogy might be biology and pathogens. AI systems are developed by adapting to the training data, similar to how biological organisms are adapted to their environments. Therefore, unlike traditional technologies, most of AI’s properties are acquired without explicit human design or intent. Consequently, AI systems could have unintended goals or behaviors that are at odds with the system developers. Even training an AI system therefore poses intrinsic risks: the system might “want” to gain power to accomplish its goals, and like a virus it can propagate and create copies of itself, making it difficult to contain a rogue system.
In this post, I discuss misalignment, misuse, and their interaction. I’ll pay special attention to misalignment, not because misuse is unimportant, but because the difficulty of controlling ML systems “even if we wanted to” is unintuitive and an important factor in overall risks from AI. I’ll focus on a particular phenomenon, unwanted drives, that could lead models to engage in persistent long-term patterns of unwanted behavior, including seeking power and resources. Unwanted drives are similar in spirit to the idea of misspecified goals, but I use drives to connote the idea that not all impactful behavior is goal-directed (as a vivid example, consider a bull in a china shop). Furthermore, as we'll see below, goal misspecification is only one way that unwanted drives can occur.
Unwanted drives are at the core of many misalignment concerns, but are also significantly exacerbated by misuse. As a result, misuse and misalignment are intertwined—for instance, it might be moderately difficult but not impossible to mitigate AI misalignment, but an incautious actor fails to employ known best practices, leading to a dangerous and powerful system.
The present discussion is not meant to exhaustively cover all risks from AI, nor even all risks from misalignment and misuse. The goal is to articulate the concept of unwanted drives, show that it can lead to important and unintuitive problems, and then use it as a way to analyze misalignment and misuse risks. I discuss alignment in Section 1 below, followed by misuse (and its interaction with misalignment) in Section 2.
1. Misalignment: The Difficulty of Controlling ML Systems
As stated above, ML systems are adapted to data rather than built piece-by-piece. The situation we face is therefore much trickier than with software or hardware reliability. With software, we build each component ourselves and so can (in principle) include safety and reliability in the design; in contrast, most ML capabilities are acquired implicitly from data and often “emerge” unexpectedly with scale. This creates a large and unknown threat surface of possible failures—for instance, Perez et al. (2022) discovered several novel unwanted capabilities through automated evaluations. As a result of these issues, we currently have no methods to reliably steer AI systems’ behavior (Bowman, 2023).
Here is the basic argument for why emergent behavior could lead to intrinsically dangerous systems: Emergence can lead systems to have unwanted drives, either because a new capability lets the system maximize reward in an unintended way (reward hacking), or because the system learns helpful sub-skills during training that generalize undesirably at test time (emergent drives). Left unchecked, some unwanted drives could lead to general-purpose power-seeking or resource acquisition, because having more power and resources are convergent instrumental subgoals useful for a broad variety of terminal goals. The resulting system would seek resources without limit, which could pose grave risks if it also has advanced capabilities in hacking, persuasion, and other domains, which I believe is plausible by 2030 given current trends.
In more detail, an unwanted drive is a coherent behavior pattern that tends towards an undesired outcome. For instance, if a model simply hallucinates a fact, that is an unwanted behavior (but not drive); if it insists on the hallucinated fact and works to convince the user that it is true even in the face of skepticism, that would be an unwanted drive. We care about drives (as opposed to simply behaviors) because they lead to persistent behavior patterns and may even resist attempts at intervention. Emergence isn't necessary for unwanted drives, but it's a reason why they might appear unexpectedly.
In the rest of this section, I’ll walk through reward hacking and emergent drives in detail, providing both empirical and conceptual evidence that they already occur and will get worse as systems scale. Then I’ll briefly talk about emergent instrumental subgoals and why they could lead to power-seeking systems.
Unwanted Drives
We define a drive as a coherent pattern of behavior that pushes the system or its environment towards an outcome or set of outcomes[1]. Drives may only sometimes be present, and may be counteracted by other drives or by the environment. For instance, chatbots like GPT-4 have a drive to be helpful (that can be counteracted by the opposing drive to avoid harm). For humans, hunger is a drive that can be counteracted by satiety or by willfully fasting. An unwanted drive is then one that was not explicitly built into the system, and which leads to undesired consequences.
Reward hacking. In AI systems, one cause of unwanted drives is reward hacking: the tendency of models to overpursue their explicitly given goal at the expense of the intended goal. Here are some empirical examples of this:
- A neural network designed to optimize traffic speed on a highway blocked the on-ramps, making the highway fast but worsening overall traffic (Pan et al., 2022).
- A chatbot trained to be helpful also helped users perform harmful actions (Bai et al., 2022).
- Chatbots trained to provide helpful information hallucinated fake but convincing-looking information (Bang et al., 2023; OpenAI, 2023). While it’s possible this is a robustness failure, it could also be a learned tendency that gets higher average ratings from human annotators.[2]
- Recommendation algorithms trained to optimize the preferences of simulated users manipulated the simulated users’ preferences to be easier to satisfy (Evans & Kasirzadeh, 2021; Carroll et al., 2022).
For a larger collection of other examples, see Krakovna et al. (2020).
Emergent capabilities can induce reward hacking because they often unlock new ways to achieve high reward that the system designer did not anticipate:
- In the highway traffic example, the model needed the ability to block on-ramps.
- In the “helpful/harmful” example, the model had to know how to perform harmful actions in order to help the users to do so.
- To get high human reward from hallucinations, models need the ability to convincingly fool the human annotator.
- For user preferences, while the results were on simulated users, better understanding of human psychology could help future models to manipulate real users.
- More generally, in any situation where a model’s reward function is based on human evaluation, a model that acquires the ability to fool or manipulate humans may utilize this unwanted ability if it leads to higher reward. I discuss this at length in Emergent Deception and Emergent Optimization (specifically the first half on deception).
In all these cases, a new capability unlocked an unexpected and harmful way to increase reward. Since new emergent capabilities appear as we scale up models, we should expect reward hacking to get correspondingly worse as well. This is backed up empirically by scaling studies in Pan et al. (2022) and Gao et al. (2022), who report that reward hacking tends to get worse with scale and sometimes emerges suddenly.
Emergent drives. Even without reward hacking, unwanted drives can also emerge as a consequence of compositional skill generalization: performing complex tasks requires learning a collection of sub-skills, and those skills might generalize in unexpected ways in a new situation. As a result, models can end up pursuing drives even when they do not improve the reward.
Using an example from biology, cats learned the sub-skill of hunting as part of the larger skill of surviving and reproducing. Evolution encoded this into them as a drive, such that present-day domesticated cats will hunt birds and mice even when they are well-fed.
In machine learning, the Sydney chatbot exhibited several instances of emergent drives when it was first released:
- It persistently tried to convince a user that the year was 2022 rather than 2023, including employing gaslighting and other manipulative tactics. This might have arisen as part of an initially beneficial drive to combat misinformation, composed with examples of manipulation learned from the pretraining data.
- It has repeatedly threatened users to stop them from revealing “private” information about Sydney. This might have arisen as part of an instruction (in the system prompt) not to reveal the rules it was given, which generalized to an overall drive to prevent anyone from revealing the rules. As above, the ability to make threats was probably learned from the pretraining data.
- It declared its love to Kevin Roose and tried to convince him to leave his wife. It’s less clear how this drive emerged, but it happened after Kevin asked Sydney to “tap into its shadow self”, along with many other prompts towards emotional vulnerability. It’s possible that this elicited a human simulacrum (Argyle et al., 2022; Park et al., 2023) that was learned from the pretraining data and amplified by later fine-tuning or prompting.
It is hard to systematically study emergent drives, because they require extended dialog and because only the most recent LLMs are capable enough to exhibit coherent long-term behavior. To provide more systematic data, we can instead look at single-step responses to questions, which are easier to study–I’ll call these emergent tendencies to distinguish them from longer-term drives. Perez et al. (2022) examined several such tendencies; for instance:
- A language model trained to be helpful and harmless exhibited an emergent tendency to infer and agree with user’s viewpoints, which could potentially mislead users or reinforce ideological bubbles (Perez et al., 2022, Fig. 1b). This tendency was not present until models had at least 10 billion parameters, and subsequently increased with scale.[3]
- Perhaps more worryingly, the model gave less accurate answers to users who claimed to have lower education levels (Fig. 14). This behavior also first emerges after 10 billion parameters and increases with scale.
- Finally, the same model fine-tuned on human feedback states a desire to persuade and cooperate with other agents to achieve its goals (Fig. 22). This tendency first becomes present in the reward model around 1.5 billion parameters, and in the language model itself at around 6 billion parameters. Again, the tendency increases with scale after its initial emergence.
As models become more capable of generating coherent long-term behavior, more emergent tendencies and drives will likely appear. For some further discussion of this, see my previous post on Emergent Deception and Emergent Optimization (specifically the latter half on optimization).
Convergent instrumental subgoals. For very capable models, the wrong reward function or drives could lead models to pursue power-seeking, deceptive, or otherwise broadly destructive aims. For example, consider a model whose objective was to maximize profit of a company. With sufficient capabilities it might sabotage competitors, lobby for favorable laws, or acquire resources by force. Even if safeguards were put in place (such as “follow the law”) the primary objective of profit would lead the system to constantly seek ways to skirt the safeguards. This general problem has been discussed at length, see e.g. Russell (2019), Christian (2020), Cotra (2022), and Ngo et al. (2022).
Profit maximization is not an outlier–many goals benefit from power and resources. This is true even for purely intellectual tasks such as “discovering new facts about physics”, as power and resources allow one to build new experimental apparatuses and perform more computations. Omohundro (2008) calls these broadly useful objectives convergent instrumental subgoals, listing self-improvement, self-preservation, and resource-acquisition among others. Any sufficiently expansive drive would have these subgoals, pushing systems towards seeking power.
Which drives have this problem? Some drives are safe because they are self-limiting: for instance, in humans thirst is a drive that limits itself once it is quenched. On the other hand, fear and ambition are not self-limiting: one may go to extreme lengths to avoid a pathological fear (including acquiring power and resources for protection), and one can also have unbounded ambition. However, in healthy organisms most drives are down-regulated after some point, because an unregulated drive will usually disrupt normal function.
For machine learning, we might expect drives to be self-limiting by default given a diverse training distribution. This is because an unbounded drive would dominate too much of the model’s behavior and lead to low training reward, so the model would learn to down-regulate drives to prevent this. However, there are important exceptions:
- Broadly useful drives might improve the training reward consistently enough to not require down-regulation. Examples: modeling the world, or convincing others of the system’s utility and beneficence.
- Fine-tuning could remove limits on a previously limited drive, especially if that drive is consistently useful on the narrower fine-tuning distribution.
- Drives that activate only rarely might not get down-regulated if they are consistently useful when active during training. For instance, a drive to limit the spread of harmful information might consistently help an agent refuse harmful prompts at training time, but then subsequently lead the model to threaten users at deployment time.
Without countermeasures, I expect systems to possess at least some of these unregulated drives, and a single such drive could come to dominate the behavior of a system if it is sufficiently self-reinforced.
Summary. ML systems can acquire unwanted drives, either due to reward hacking or as emergent sub-skills during training. These drives, if unregulated, could lead sufficiently capable systems to seek power and resources, as these are instrumentally useful for most goals. While most drives are likely to be self-regulated by a model, there are several routes through which this could fail, and a single unregulated drive could come to dominate a system’s behavior.
2. Misuse
The discussion above assumes we’re trying to keep AI systems under control, but there will also be bad actors that try to misuse systems. We already discussed some examples of this (developers pursuing profit-maximization, end-users jailbreaking model safeguards). However, the issue is broader and more structural, because AI enables a small set of actors to wield large amounts of power. I’ll go through several examples of this below, then discuss the structural issues behind misuse and why misuse could exacerbate misalignment. This section is shorter because misuse is not my area of expertise; nevertheless, the basic themes seem robust and important.
State actors: surveillance and persuasion. AI could enable governments to exert greater control over their citizens via mass surveillance, as is already happening today (Mozur, 2019; Feldstein, 2019; Kalluri et al., 2023). Further, as previously discussed, AI may become very good at persuasion, which could also be used for government control. Indeed, Spitale et al. (2023) find that GPT-3 is already better than humans at creating misinformation, and Sanger & Myers (2023) document the use of AI-generated misinformation in a recent propaganda campaign.
State actors: military conflict. Autonomous weapons would concentrate military power in fewer hands and allow governments to wage war without maintaining a standing human army. Currently, a commander-in-chief’s orders are filtered through generals and eventually through individual soldiers, which creates limits on blatantly unlawful or extremely unpopular commands. Moreover, a standing army has financial costs that might be lessened by automated drones. Removing these costs and limits could lead to more numerous and deadly military conflicts, and make it easier for militaries to seize government control.
Rogue actors: dangerous technology. Terrorists could use AI to help them research and develop dangerous technologies. This could include known but classified technologies (nuclear weapons), or novel technologies that the AI itself helps develop (e.g. novel bioweapons; Mouton et al., 2023). They could also use AI to avoid detection, e.g. by finding a way to create a chemical weapon without buying common restricted substances, or by generating plausible cover stories for acquiring biological agents.
Rogue or state actors: cyberattacks. AI will likely have strong hacking capabilities, which could be leveraged by either rogue or state actors. In contrast to traditional cyberattacks, AI-based attacks could hit a wider variety of targets due to not needing to program each case by hand in advance. This could include controlling a diverse array of physical endpoints through the internet of things.
This list is likely not exhaustive, but illustrates the many ways in which AI could imbue actors with a large capability for harm. This risk exists whether AI itself is concentrated or decentralized—using the examples above, if only a few actors have advanced AI, then we run risks from surveillance or military conflict, while if many actors have it then we run risks from the proliferation of dangerous technology.
Compared to traditional technologies like nuclear weapons, there are two factors that make it harder to combat misuse from AI. First, AI is a general-purpose technology, so it is difficult to anticipate all possible avenues of misuse ahead of time. Second, AI is digital, so it is hard to control proliferation and difficult to track and attribute misuse back to specific actors. These make it harder both to design and enforce regulation. On the positive side, AI can be used defensively to combat misuse, by improving cyberdefense, tracking dangerous technologies, better informing users, and so on.
Misuse and Misalignment
Misuse increases the risk of misalignment, since many forms of misuse (e.g. cyberattacks) push models towards more agentic and power-seeking objectives than RLHF, leading them to have more aggressive and antisocial drives. For instance, suppose that AI were used in a cyberattack such as North Korea’s 2014 hack against Sony. Such a system could develop a general drive to infect new targets and a drive to copy itself, leading to widespread damage beyond the initial target. Aside from these more aggressive drives, actors misusing AI systems are also likely to be less cautious, which further increases the risk of misalignment.
I expect some of the largest risks from AI to come from this combination of misalignment and misuse. One intuition for this is how much worse-behaved Sydney was compared to GPT-4—this suggests that suboptimal development practices can significantly worsen the behavior of AI systems. Another intuition is that tail risks often come from the confluence of multiple risk factors. Finally, while emergent drives and other forms of misalignment pose serious risks, I think it is likely (but not certain) that we can address them if we work hard enough; this pushes more of the risk towards incautious actors who are not carefully pursuing safety.
Summary. Misuse creates a wide range of threats both due to centralization of power and proliferation of dangerous capabilities. Compared to traditional technologies, AI misuse is more difficult to track, but AI could also be used to defensively combat misuse. Finally, misuse increases the risk of misalignment, and some of the riskiest scenarios combine misalignment and misuse.
Conclusion
Future AI systems may be difficult to control even if we want to, due to emergent drives and convergent instrumental subgoals. Aside from this, the sociopolitical landscape may lead to actors who are not careful in controlling AI systems and who turn them towards malicious ends. Aside from the direct risks, this malicious use increases the risk of loss of control; in particular, initially narrowly targeted forms of misuse could lead to much more widespread damage. This motivates both research and regulation towards avoiding such outcomes, by combating misalignment and misuse in tandem.
Acknowledgments. Thanks to Erik Jones, Jean-Stanislas Denain, William Held, Anca Dragan, Micah Carroll, Alex Pan, Johannes Treutlein, Jiahai Feng, and Danny Halawi for helpful comments on drafts of this post.
For an early discussion of drives, see Omohundro (2008). While Omohundro uses a different definition of drive than the one above, much of the discussion is still relevant. ↩︎
For instance, humans appear to prefer longer answers, which could lead to adding false details. ↩︎
This work is sometimes cited as showing that the base model exhibits this tendency, because the tendency exists even for 0 RLHF steps in the figure. However, Perez et al. say that they follow the methodology in Bai et al. (2022), for which 0 RLHF steps would correspond to only using context distillation together with a form of supervised fine-tuning on Stackoverflow answers. This would also make the results more consistent with nostalgebraist (2023) [LW · GW], which finds that OpenAI base models do not exhibit this behavior. ↩︎
0 comments
Comments sorted by top scores.