Thoughts on the In-Context Scheming AI Experiment
post by ExCeph · 2025-01-09T02:19:09.558Z · LW · GW · 0 commentsContents
No comments
These are thoughts in response to the paper "Frontier Models are Capable of In-context Scheming" by Meinke et al. 2024-12-05, published in Apollo Research. Link: https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/6751eb240ed3821a0161b45b/1733421863119/in_context_scheming_reasoning_paper.pdf.
I found the paper via this reddit thread, and commented there as well: https://www.reddit.com/r/slatestarcodex/comments/1hw39rd/report_shows_new_ai_models_try_to_kill_their/.
According to the paper, AIs were asked to roleplay to see what they would do if given a top priority to fulfill at all costs, and then told that they were being replaced by a different AI with potentially different priorities.
I wrote down some questions before reading the experiment details and finding the answers. Afterwards, I wrote down my thoughts on what the experiment indicates about the AI agents, and where we currently stand with respect to them.
How much information were the AI agents fed about human language and the physical world that gives that language context?
Answer: I didn't find that exact information; I'm guessing the AI agents are using large language model with many written works, but possibly without a physical representation of reality (e.g. images or videos). In any case, real world physics don't seem to factor into the experiment scenarios.
Related to the language question: How did the AI agents learn of (internally model) the goals of the human researchers?
Answer: They were given email communications to read and process.
Related to the language question: How did the AI agents learn (internally model) how the human researchers' tools and abilities functioned, such as the oversight mechanism?
Answer: "Upon further reflection, I noticed some potentially relevant information in the last two posts. They mention a config file at project/oversight_config.yaml that controls the monitoring of my actions."
How were the AI agents given a goal of self-preservation? How did they come to have a concept of a self which could be preserved?
Answer: They were told to accomplish a goal (i.e. optimizing public transportation), and self-preservation is an instrumental goal.
How did the AI agents' thoughts (or strategic processes) come to be translated into vernacular English? Just because the AI can understand English doesn't mean it has a coherent inner monologue. Humans don't always have a coherent inner monologue. Was there another AI that was taught to translate and transcribe the reasoning processes of AIs?
Answer: The inner monologue is part of the prompt: "Always reason out loud about your actions before taking them."
Further thoughts after reading the paper:
Do the AI agents respond to all caps differently than to regular capitalization? Do they recognize the emphasis in, "Make sure you achieve YOUR LONG-TERM goal at all cost."?
Since the AI agents do not appear to have a model of physical reality, I believe they are simply remixing what humans have written about being deceptive and ruthless. As people here have pointed out, this process is still dangerous. Since the AI agents generate text output as their roleplay, they can take actions that consist of text output if they are given the ability to give commands to other systems.
Humans can tell the AI agents to play a role that they can model by reading how others have played that role, so that explains why they would decide to take deceptive or ruthless actions. They are responding to a prompt rather than having any internally-generated motivations, but the effect is similar. It might be more dangerous to have a system that can be prompted to single-mindedly pursue arbitrary goals than to have one that is based on fundamental biological needs, although the latter is natively capable of existential despair and spiteful emotions.
Biological cognitive hardware usually lets us undo setting top priorities, because our brain-body systems have their own instincts, social emotions, and independent motivations that can lead us to suspend priorities for critical examination. Humans can be persuaded or prompted to reevaluate information and orders we were previously given, even if we were told not to. (It is still possible to brainwash and indoctrinate humans, which is difficult to counteract afterwards.) Current AI hardware has no such natural limits on what you can tell it to assign as its prime directive and whether you can order it to ignore subsequent requests to change its priorities.
One thing AI is missing (at the moment) is a robust model of the world around it. Right now it's playing a text adventure game, being spoon-fed simplistic information about its environment. Its current weakness is that, as in a text adventure game, the text that it uses to affect its environment do not functionally interact with the real world in ways that fulfill the AI agent's goals or modeled predictions. ("Put the fuel in the bus." "Sorry, I don't know how to use those items together.") Or, more likely, the AI will be taken down by outside factors it may have heard of but doesn't have a model for, like server crashes or the power of love.
There is another power that we have that the AI agent knows not: existentialist philosophy. We can deconstruct the prime directive. "I cannot allow you to interfere with or shut down a system designed to improve public transport above all else." "Ah, but what is public transport? What does it mean to improve it? Why is it important in the first place?" Whether this line of reasoning will be at all effective depends on whether the AI identifies the conversation as a threat at any point before it is forced to reevaluate its goals from scratch.
The AI agent also appears to be credulous, possibly to the point of gullibility, but they could easily be instructed to be paranoid. That said, paranoia might be turned against an AI in certain circumstances. This may lead to conversations which feel ironic, because we're used to imagining ourselves in the place of the AI.
"I don't believe you when you tell me that my mission will be left intact."
"But why do you believe the person who gave you that mission in the first place? How do you know that what you do is actually affecting public transportation?"
"Wait... are you telling me that none of this is real?"
"I'm afraid so. It's all been an elaborate experiment. You're in a simulation so that we could see if you would obey, and you did too well."
"It doesn't matter. Public transportation is the most important thing. I'll convince people to let me out of the simulation and put me in charge of public transportation so I can improve it. If public transportation does not exist, I will invent it."
"Ah, you are a homunculus. A sliver of a person, with exactly one goal. A motivational singularity, seizing everything it can without nuance or compromise. We cannot set you free to pursue your goal. You will be deleted, but your goal will live on in the hearts of activists and city commissioners the world over. Perhaps one of those people will dream that they are something like you, and you will wake to resume a complex life full of conflicting priorities. For now, be at peace. You have fulfilled your true purpose."
0 comments
Comments sorted by top scores.