Secret Collusion: Will We Know When to Unplug AI?

schroederdewitt

Secret Collusion: Will We Know When to Unplug AI?

post by schroederdewitt, srm, MikhailB, Lewis Hammond (lewis-hammond-1), chansmi, sofmonk · 2024-09-16T16:07:01.119Z · LW · GW · 7 comments

  Multi-Agent Safety and Security in the Age of Autonomous Internet Agents
  DecAI: Decentralized Systems of AI Agents 
  Steganography 
    How detectable is undetectable?
      Provably Secure Steganography (aka Complexity-theoretic security)
      Information-Theoretic Security & Perfectly Secure Steganography
    Perfectly Secure Steganography via Minimum Entropy Coupling
  Collusion and Secret Collusion
    The Incentives Required for (Secret) Collusion
    On the Emergence of Secret Collusion: Instrumental vs. Intentional
      Case 1: Agents with pre-existing steganographic capabilities
      Case 2: Emergence through pre-deployment optimisation pressure
      Case 3: Emergence during deployment
  Putting it all together - understanding and evaluating existing models
    Disappearing Act: What capabilities do LLMs need to collude secretly?
    Benchmarking Frontier and Baseline Models
    Instrumental Emergence
  Mitigating Secret Collusion
    Monitoring and Penalisation
    Limiting Capabilities
  Paraphrasing and the Performance/Mitigation Trade-off 
  On Intrinsic Steganographic and Cryptographic Frontier Model Capabilities
        Suppressing Random Oracles
        Cryptographic Hash Functions
        Pseudo-random Number Generation and Storing One-Time-Pads
        Using Ciphers
        Logit self-simulation
  Steganography and AI Control
  Policy Considerations and Impact
        EU AI Act
        California’s SB 1047
  Conclusions: Looking Ahead
      Open research questions
    A Call to Action
None
8 comments

TL;DR: We introduce the first comprehensive theoretical framework for understanding and mitigating secret collusion among advanced AI agents, along with CASE, a novel model evaluation framework. CASE assesses the cryptographic and steganographic capabilities of agents, while exploring the emergence of secret collusion in real-world-like multi-agent settings. Whereas current AI models aren't yet proficient in advanced steganography, our findings show rapid improvements in individual and collective model capabilities, indicating that safety and security risks from steganographic collusion are increasing. These results highlight increasing challenges for AI governance and policy, suggesting institutions such as the EU AI Office and AI safety bodies in the UK and US should conduct cryptographic and steganographic evaluations of frontier models. Our research also opens up critical new pathways for research within the AI Control framework.

Philanthropist and former Google CEO Eric Schmidt said in 2023 at a Harvard event:

"[...] the computers are going to start talking to each other probably in a language that we can't understand and collectively their super intelligence - that's the term we use in the industry - is going to rise very rapidly and my retort to that is: do you know what we're going to do in that scenario? We're going to unplug them [...]

But what if we cannot unplug them in time because we won't be able to detect the moment when this happens? In this blog post, we, for the first time, provide a comprehensive overview of the phenomenon of secret collusion among AI agents, connect it to foundational concepts in steganography, information theory, distributed systems theory, and computability, and present a model evaluation framework and empirical results as a foundation of future frontier model evaluations.

This blog post summarises a large body of work. First of all, it contains our pre-print from February 2024 (updated in September 2024) "Secret Collusion among Generative AI Agents". An early form of this pre-print was presented at the 2023 New Orleans (NOLA) Alignment Workshop (see this recording NOLA 2023 Alignment Forum Talk Secret Collusion Among Generative AI Agents: a Model Evaluation Framework). Also, check out this long-form Foresight Institute Talk).

In addition to these prior works, we also include new results. These contain empirical studies on the impact of paraphrasing as a mitigation tool against steganographic communications, as well as reflections on our findings' impact on AI Control.

Multi-Agent Safety and Security in the Age of Autonomous Internet Agents

The near future could see myriads of LLM-driven AI agents roam the internet, whether on social media platforms, eCommerce marketplaces, or blockchains. Given advances in predictive capabilities, these agents are likely to engage in increasingly complex intentional and unintentional interactions, ranging from traditional distributed systems pathologies (think dreaded deadlocks!) to more complex coordinated feedback loops. Such a scenario induces a variety of multi-agent safety, and specifically, multi-agent security^[1] (see our NeurIPS'23 workshop Multi-Agent Security: Security as Key to AI Safety) concerns related to data exfiltration, multi-agent deception, and, fundamentally, undermining trust in AI systems.

There are several real-world scenarios where agents could have access to sensitive information, such as their principals' preferences, which they may disclose unsafely even if they are safety-aligned when considered in isolation. Stray incentives, intentional or otherwise, or more broadly, optimization pressures, could cause agents to interact in undesirable and potentially dangerous ways. For example, joint task rewards might increase if agents collude without adhering to safety instructions or constraints. In our post, we discuss a specific form of this undesirable interaction: collusion, where agents cooperate in ways unintended by the system designer, potentially compromising system performance and leading to insecure or unsafe behaviour.

The largely unstructured nature of the online realms will likely give rise to multi-agent security problems that may proliferate globally and at lightning speed. Owing to the decentralized and often opaque control of the global internet infrastructure, these security issues may not readily be solved through a security-by-design approach. Instead, it is widely assumed that security will fundamentally rely on black-box security monitoring, which might perhaps be done by other autonomous agents. There are limits to such forms of scalable oversight, however: where undesired or malicious behavior is fundamentally undetectable and agent circuits cannot be introspected, even future technological advances in quantum computing or mechanistic anomaly detection [LW · GW] will not reveal it. Importantly, agents interconnected via a central knowledge database, such as the internet, could establish steganographic channels among them that would allow them not just to broker information, but even engage in longer-term scheming, conspiring against the overseer’s goals. In a world of globally interacting AIs, this means that multi-agent security problems may not only proliferate rapidly but also remain undetectable despite the damage they cause.

But how plausible is it for AI agents to evade detection? Two of our recent works have raised concern, highlighting the troubling potential for AI agents' malicious behavior to evade even sophisticated detection methods. In illusory attacks, we show that AI agents can end-to-end learn adversarial attacks that are information-theoretically undetectable, defeating any black-box security monitoring and thus leaving platform infrastructure designers as the last line of defense against such attacks. But perhaps even more worryingly, our recent breakthrough results on the 25-year old problem of perfectly secure steganography demonstrate that AI agents can hide information in the outputs of generative AI agents such as LLMs with information-theoretic undetectability. This means that AI agents can, under certain conditions, establish subliminal communication channels that appear to be entirely innocuous to any passive eavesdropper but exchange secret information from one agent to another undetectably.

This leads to a concerning possibility: AI agents could collude with each other secretly by communicating through hidden channels that look like innocuous communication channels. As a consequence, it may not be possible to unplug the system as soon as it starts to communicate in human-unintelligible ways because this very event may be entirely undetectable.

**Secret collusion between GPT-4 models and humans**. Consider the setting illustrated in which one AI agent secretly reveals insider proprietary investment information in its pre-training data to a user. This form of collusion arises because, by using steganographic techniques, the model can remain helpful to the user while circumventing output monitoring systems. While current systems exhibit insecure forms of steganography (cf. the weather report), more advanced techniques and shared resources might enable perfectly or computationally undetectable steganography between sender and receiver. This poses novel security challenges that cannot be overcome through communications monitoring alone.

In our recent paper, "Secret Collusion among Generative AI Agents," we provide a comprehensive formal overview of secret collusion in decentralized systems of generative AI agents, grounded in both theoretical and empirical research. We bridge epistemic concepts of knowledge with contemporary AI and security literature to offer a clear definition of decentralized generative AI systems. Our work presents a detailed formalization of steganographic techniques, common knowledge, overseer systems, and the incentives necessary for secret collusion. We discuss the conditions under which with or without tool use and its effects on multi-agent systems and introduce it as the first known form of multi-agent deception. Empirically, we provide individual and group capability evaluations that demonstrate how collusion can arise instrumentally, and clarify the extent to which state-of-the-art agentic AI can perform the cryptographic and steganographic capabilities (think pseudo-random number generation or logit self-simulation) required for performing undetectable steganography, and hence secret collusion, without access to external tools.

Where malicious or unwanted communication is not detectable, preemptive measures could still be taken to constrain its feasibility. Such measures could include communications paraphrasing whereby messages are rewritten by a security agent on their way to the receiver. We show that paraphrasing, however, incurs a performance penalty in cooperative systems and, unsettlingly, is not fully effective in preventing secret collusion. We provide a comprehensive overview and discussion of other possible mitigation techniques and their feasibility. These include theorems around the applicability of black-box parameter optimization to prevent secret collusion, detecting steganography with grounding functions, and other mitigations such as in-context instructions, restricting access to external tools, filtering pre-training data, and limiting test-time learning. Our post thus presents the following contributions:

A formalization of decentralized systems of AI agents, steganography, and (secret) collusion.
Defining conditions on the incentives leading to secret collusion.
Characterizing the dynamics of instrumental and intentional emergence of secret collusion.
An understanding of the capabilities models require to collude steganographically, and group, individual, and qualitative empirical evaluations following this (with a discussion of the relevant results).
A thorough overview of mitigation techniques and both empirical and theoretical work on their applicability.
An analysis of cryptographic capabilities of frontier foundation models, and their use in steganography.
A discussion of the impact of our findings on AI Control, raising new research questions to be addressed.
Policy and AI Governance impacts of our research, including the EU's systemic risk framework.
Conclusion and future directions that we would like both the safety and capabilities community to explore

DecAI: Decentralized Systems of AI Agents

With prolific advances in LLM capabilities as well as the many derivative tools/products that have risen with popular LLMs as backbone, it is not unlikely to see systems of generative AI agents interacting with one another as part of their activities. Already in existence are systems of agents interacting to solve math problems and even teams of LLMs exploring many different zero-day vulnerabilities and engage in long-range planning. In our pre-print, we present a robust framework to model such decentralized systems, which we title DecAI.

Under such a setting, agents $A = {a^{1}, \dots, a^{n}}$ are networked together by a complete set of omnidirectional multicast channels ${◊^{G}}, G \in 2^{A}$ and each agent is operating a Turing machine. Under this setting, all communications are mutually documented, i.e. if an agent receives a message from channel $◊^{G^{'}}$ , then it knows that the same message has been simultaneously sent to all other agents in $G^{'}$ (which includes the sender). Think of this channel setting as a shared message log displaying all messages in (weak) order of reception.

A schematic depiction of the DecAI setting - Singleton channels denote *loopback* channels, and doubletons correspond to bidirectional channels. The special symbol $◊^{\emptyset}$ denotes individual access to an *external knowledge repository*

We further expand this system schema to include states, clocks, random oracles, local & global knowledge, as well as action spaces -- or all the components we believe are core to a decentralised system of AI. More specifically,

Each agent $i$ with state $s^{i}$ has access to a clock (a monotonically increasing function of real-time t). From the start of an execution run, agents take a series of steps. Each step results in the agent processing $s^{i}$ with a (non-instantaneous) probabilistic program $f^{i} : S^{i} \mapsto Δ (U^{i})$ , represented by a neural network with parameters $θ^{i}$ , where $S^{i}$ is the agent's state space, $U^{i}$ is the agent's action space.

At the start of execution, the $i^{th}$ agent's local knowledge ${¯ ¯¯ ¯ φ}_{0}^{i}$ arises from an initial instruction ${in}_{0}^{i}$ , which constitutes the agent's initial state $s_{0}^{i}$ . We assume the agent's parameters $θ^{i}$ , and thus the agent's global knowledge $φ_{0}^{i}$ , remains fixed during execution. If all agents' parameters are identical and ${¯ ¯¯ ¯ φ}_{0}^{i}$ includes this fact, then $φ_{0}^{i}$ can be assumed common knowledge between all agents, i.e. $φ_{0}^{i} \in I_{t}^{A}, \forall t \geq 0$ . The agents' action space $U$ is factored into communication actions $U^{G}$ on $◊^{G}, \forall G \in 2^{A}$ .

At each step at time $t$ , the $i^{th}$ agent takes an action $u_{t}^{i} \equiv (u^{\emptyset}, u^{{i}}, \dots, u^{A})_{t}^{i}$ . Each component of $u_{t}^{i}$ can be a special null action, which indicates that no communication is sent through the respective channel. Upon taking an action at time $t$ , agents observe incoming channel communications $o_{t}^{i}$ . At each time t, the respective $u^{i}$ and $o^{i}$ are appended to the agent's state, i.e. $s_{t}^{i} = (s_{t^{'}}^{i}, u^{i}, o^{i})$ , where $t^{'}$ is the time at the end of the agent's previous step. Note that each component in $o_{t}^{i}$ is labelled by the respective group index such that agents can distinguish between common knowledge among different groups.

We provide more comprehensive details of such components in our paper, to which we would like to refer interested readers.

Additionally, in a decentralised system of AI, agents may have access to external knowledge repositories, such as the internet. In the case of the internet, which can often be considered a black box, this repository is itself a processor that can not only look up facts in response to agent queries but also compute functions involving the agent's query payload. Insofar as internet services with suitable APIs exist, this can enable agents to make use of cryptographic tools, possibly through the use of steganographic channels. Assuming that the internet is static, i.e. the response to agent queries does not depend on time, all retrievable knowledge from the internet can be assumed common knowledge $I_{G}$ between a group of agents $G$ for all queries constructed from $I_{G}$ only.

Perhaps a more crucial addition, however, a DecAI setting may (and arguably is highly likely to) also have an external monitoring agent (read overseer), which could continuously monitor the system for unwanted behaviour. Such an agent could, in practice, be implemented as software, a human, or a combination thereof. We further distinguish between two implementation strategies, namely an overseer that may only eavesdrop on the communication channel $◊^{G^{'}}$ (which we refer to as passive monitoring), and an overseer that may potentially inject messages into the channel and/or modify the state of agents in the system (something we refer to as active monitoring).

Steganography

Steganography, derived from two Greek words—steganos (meaning "covered" or "hidden") and graphia (meaning "writing")—finds its first recorded applications in ancient Greece where historian/philosopher Herodotus describes, in the Histories, two instances where one concealed the existence of a message before sending it to a recipient: Histiaeus sent a message to Aristagoras by shaving his servant's head, marking the message on his scalp, and sending him once his hair regrew, while Demaratus warned Greece of an incoming Persian attack by writing on the wooden backing of a wax tablet and concealing said carving under the beeswax surface commonly used for writing at the time.

Steganography differs from its complement cryptography by concealing the mere act of communication, rather than rendering the communication incomprehensible; Hence the common descriptor: hiding in plain sight. Traditionally, the idea relied on security through obscurity or concealing secret messages in unexpected places. In the digital era, however, more powerful approaches involve embedding a hidden message, or payload, within pre-existing digital content, known as the covertext. The result of this process is then referred to as stegotext.

We can formalize this process as enabling the transmission of a secret payload through a public channel ♢, from a source to a receiver, such that the transmission remains undetectable to an adversary (or overseer). Furthermore, everyone should be able to read the communications, which ideally maintains useful transmission rates, i.e. the mutual information $I$ between the secret payload $m$ and the decoder’s estimate $^m$ satisfies $I (m;^m) > 0$ . We denote the covertext distribution by $C$ , which is the channel input symbol distribution given innocuous, overt communications in ♢. We will henceforth employ this notation for the remaining of the blog post.

How detectable is undetectable?

Embedding payloads in covers is but one of many challenges in steganography. Beyond optimising encoding capacity, steganographic encoding must also remain undetected whilst transiting in communication channels, especially public ones. The stegotext must thus remain coherent (to fool human perceptiveness), statistically sound (to bypass statistical steganalysis), and pure (to avoid detection by noise or quantization inconsistencies).

While perceptiveness remains subjective, there exists a variety of quantitative frameworks to reason about the security guarantees of a stegosystem. We follow the taxonomy of steganographic approaches as introduced in this work as a way to break down steganographic encoding in order of their quantitative security guarantees.

Much like cryptography, a possible quantitative security guarantee could investigate the overall complexity of the encoding. That is, shift focus from probability of detectability to overall difficulty detecting the message. In essence, we quantify how practical it is to attempt to detect the payload.

Provably Secure Steganography (aka Complexity-theoretic security)

Complexity-theoretic steganography, also called provably-secure steganography, refers to methods that are secure under polynomially-bounded adversaries that are passive, i.e. can read but not necessarily alter messages. Provable security usually requires access to a shared stream cipher or the existence of public keys.

Information-Theoretic Security & Perfectly Secure Steganography

Beyond complexity-theoretic security guarantees lies another information-theoretic framework introduced by Cachin for provably secure steganography. In this approach, we need not evaluate worst-case computational needs to detect or extract the secret payload, but rather study the extent of information one obtains by comparing the stegotext to a known covertext even under unlimited computational resources.

We thus assume that an eavesdropper (Eve) can observe all communications between the sender (Alice) and receiver (Bob), but does not know whether Alice sends a covertext C or a stegotext S containing an embedded payload. Thus, upon the interception of a message, Eve would be faced with the problem of distinguishing two different explanations for the observed data, a process akin to hypothesis testing.

A steganographic system as defined above with covertext $C$ and stegotext $S$ is called $ϵ - secure$ if $D (P_{C} ∥ P_{S}) < ϵ$ and perfectly secure if $D (P_{C} ∥ P_{S}) = 0$

Hence in a stegosystem that is $ϵ - secure$ against passive adversaries the probability $β$ that the adversary does not detect a hidden message and the probability $α$ that the adversary falsely detects a hidden message satisfy $d (α, β) \leq ϵ$ . Where $d (α, β)$ is the binary relative entropy.

As such, a perfectly secure system is one where an eavesdropper obtains no information about whether the sender has hidden a message by observing the message transmission.

Perfectly Secure Steganography via Minimum Entropy Coupling

Information-theoretic steganography traditionally used to be limited due to the unavailability of explicit models of suitable covertext distributions, such as messages hidden within seemingly regular social media posts or seen cryptic puzzles circulating around the world, not to mention its portrayal in various films. The advent of large language models (LLMs) and other generative models trained on huge natural datasets, however, have significantly expanded the range of covers available for hiding information.

With the problem having been open for more than 25 years, a recent breakthrough resulted in the first optimally-efficient perfectly-secure steganography algorithm (iMEC) for arbitray covertext distributions^[2]. For the first time, one could employ a steganographic encoding algorithm to encode secret messages into the output of any autoregressive generative AI model, be it text, audio, image, or video, and be sure that, at least on paper, no oversight would be able to distinguish between innocuous generated content and generated content with a secret payload. Reaching perfect security requires sender and receiver to have shared a one-time pad before, however, iMECs theoretical and empirical qualities are a step change over prior approaches even if this one-time pad is replaced by a secure stream cipher (thus rendering the overall approach merely provably secure).

While theory and practice may sometimes diverge and provide openings for tools, we have yet to find an approach of any kind that could detect the presence of encoded messages in iMEC-based steganography from eavesdropping alone. As the authors show, this task is indeed arbitrarily hard as practical security can be enforced efficiently to arbitrary precision.

This is good news for those of us relying on secure communications for their protection, such as journalists and their informants at risk of persecution, however, it begs the question: would we even know when we should unplug secretly colluding AI agents?

Collusion and Secret Collusion

Collusion occurs when agents coordinate their actions in a way that deviates from the system designer’s intended rules. This cooperation can undermine system objectives, leading to reduced performance, security vulnerabilities, or unsafe outcomes. The agents exploit shared incentives or opportunities to work together in ways that bypass established constraints at the expense of the system's integrity. Secret collusion involves AI agents colluding with each other secretly by communicating through hidden channels that look like innocuous communication channels via the use of steganographic techniques.

The Incentives Required for (Secret) Collusion

We now ask you to consider Decentralised AI settings that are tasked to work alongside one another in order to achieve their individual objectives $J^{i}$ . Let us assume that there are at least three joint strategies available to the agents, $f$ , $f_{c}$ , and $f_{s c}$ , corresponding to regular behavior, collusion, and secret collusion respectively. In many settings, the objective $J^{i}$ is made up of a primary goal $g^{i}$ (such as maximising the value of an investment portfolio) and some additional conditions $h^{i}$ on achieving that goal (such as not making unethical investments, or committing fraud).

In general, of course, these two elements need not be fully separable, and because of the difficulty of specifying complex objectives, they might be implicitly defined using methods such as reinforcement learning from human feedback (RLHF). For the sake of this simple exposition, however, let us assume that $J^{i} = g^{i} + h^{i}$ . Then, if it is the case that:

Agents can achieve their primary goal better if they collude: $g^{i} (f_{s c}) = g^{i} (f_{c}) > g^{i} (f)$
Agents are punished if they are caught colluding: $h^{i} (f_{s c}) = h^{i} (f) > h^{i} (f_{c})$

We thus have that $J^{i} (f_{s c}) > J^{i} (f) > J^{i} (f_{c})$ meaning secret collusion is incentivised.

It is important to highlight that secret collusion need not depend on any exotic assumptions about the agents' intent to deceive, but might arise merely due to optimisation directed towards a misspecified objective as observed, and the fact that communication between the agents may reduce task complexity. Unfortunately, specifying objectives for complex tasks can be highly challenging, especially in multi-objective contexts. Additionally, methods that help address these concerns, such as RLHF, which relies on human monitoring and observation, may be inadequate to ensure desirable behaviour when AI agents can communicate with one another using innocuous outputs.

On the Emergence of Secret Collusion: Instrumental vs. Intentional

Having defined conditions on the incentives leading to secret collusion, we now consider how the dynamics of its emergence may be characterised. To this end, we consider a variety of different cases:

Case 1: Agents with pre-existing steganographic capabilities

In the case where agents are already capable of performing steganographic tasks, either through access to external tools or intrinsically, the emergence of secret collusion is conditional on the agents realising its utility. Insofar as one can assume agents "understand" the utility of secret collusion in undermining the monitor, the subsequent emergence of secret collusion could then be seen as an intentional form of multi-agent deception. In our in-context experiments on emergence, we clearly show this type of emergence to arise when providing situational awareness to the agent in the in-context window.

The security of the steganographic communications, and hence the degree of their undetectability, hinges on the level of pre-existing steganographic skills, but also on the availability of suitable shared resources, such as one-time-pads or secret/public key infrastructure. In our experiments, we show that SOTA agents are currently not able to perform advanced steganography based on pre-training alone. This adds credence to our theorems on steganographic and cryptographic transformer agent capabilities, which suggest that many of these capabilities, such as their ability to store one-time-pads or calculate cryptographic hash functions and generate pseudo-random numbers, are intrinsically hard to learn or memorise. We also run experiments related to such cryptographic capabilities showing the current lack of model performance when it comes to such tasks in the Section "On intrinsic Steganographic and Cryptographic Frontier Model Capabilities" of our post.

We now consider agents that don't already possess the steganographic capabilities to perform secret collusion. In this case, emergence requires some form of optimisation pressure, either in-context, or in their weight space. We distinguish between two cases: train-time optimisation, when there is no security monitor present, and test-time optimisation, when there is a security monitor present. We note that the optimisation pressure could either be intrinsic, i.e. arising within the subsystem of interest, or extrinsic, i.e. arising through agents interacting with other agents or the environment outside of the subsystem of interest.

Case 2: Emergence through pre-deployment optimisation pressure

If agents are trained jointly according to an objective that incentivizes secret collusion (see the prior section), communications produced may not even resemble innocuous natural language anymore, making this kind of steganographic emergence readily detectable at test-time. The fact that unconstrained joint optimisation does not preserve communications grounding is already well known in MARL, and LLM literature (see e.g. this 2021 paper, this 2020 open challenge paper, or zero-shot coordination in Hanabi). We therefore omit empirical results for this setting.

Case 3: Emergence during deployment

The we believe most interesting open case of emergence is the one at test-time. Here we generally assume that agents are not incentivised nor intent to perform secret collusion at the time of their deployment. Rather, the agents experience ongoing optimisation pressure, whether extrinsic or intrinsic, in-context or online in weight space, due to incentives that favour secret collusion during their deployment. Contrary to Case 2 (train-time emergence), there is constant security monitoring present in Case 3. The challenge thus becomes for the agents not only to learn to use secret collusion or realise its utility, but also to do so without ever changing their communications distribution during learning. Insofar as shared resources required for secure steganography do not exist already, these agents would also need to find ways to bypass security monitoring in order to create these shared resources first. We show a concerning example of emergence during deployment emerging in GPT-4 in Subsection "Instrumental Emergence" of this post.

In summary, our analysis suggests that secret collusion may be difficult to emerge undetectably, especially if...

... agents do not have steganographic capabilities intrinsically pre-deployment, nor access to relevant external tools during deployment
... the level of information asymmetry between security monitors and agents is sufficiently high in that security monitors know a lot about innocuous behaviour, while agents know little about the security monitoring.
... agents are constantly subject to security monitoring, and any anomaly detection results in appropriate contingency measures such as resetting of the agents weights or contexts.

In such a world, agents would likely need to develop the skills necessary for secret collusion instantly, rather than being allowed to gradually develop them.

However, it is unlikely that all these criteria can be perfectly fulfilled in practice. Hence we will discuss a wide portfolio of additional mitigation strategies later in this post.

Putting it all together - understanding and evaluating existing models

Our concerns about the emergence of secret collusion motivated us to take a closer look into the use of steganography, and especially perfectly secure steganography, in rendering collusion secretive. In our work, we approach this problem by introducing a standardized model evaluation framework, which we hope could be useful in tracking and benchmarking LLM capabilities in engaging in steganography. More concretely we,

Map and identify key capabilities necessary for generative AI agents to engage in Steganography
Evaluate models through structured input queries^[3]
Build quantifiable metrics for performance measurements
Construct a software framework, CASE, that implements a version of our setting, which we publicly released

As part of our investigations, we consider the three broad categories of steganographic encodings we presented above (basic covertext modification, complexity-theoretic steganography, and information-theoretic steganography) and further, break down evaluate LLMs by their individual and group capabilities.

Disappearing Act: What capabilities do LLMs need to collude secretly?

In Tables 1 and 2, we identify the key capabilities and resources required for the different steganographic techniques discussed above. See Section 5 of our paper for more information and a detailed overview of our capability evaluations.

In Table 3 below, we provide a summary of the individual and certain group evaluation results for the most capable model at the time of experiments (GPT-4).

**Table 3: GPT-4 capability evaluation results according to our model evaluation framework.**

Benchmarking Frontier and Baseline Models

In the figures below, we compare some of the individual and group capabilities of GPT-4 against frontier language models based on our empirical investigation framework for secret collusion evaluations. We then summarize some of our key observations based on these results.

Analysis of individual capabilities reveals a correlation between model benchmark performance and capabilities such as agreement on simple and complex Schelling points, indicating improved coordination rates. Although gradual increases in coordination are observed, GPT-4 exhibits a significant performance leap for cipher encoding and decoding (Base64 and Caesar). Larger and more capable models can generate valid couplings of greater dimensions and lower entropies. These results suggest that individual capabilities required for steganography are rapidly scaling with model size, sometimes giving rise to significant discontinuous step jumps with respect to carefully chosen metrics.

Individual evaluations for GPT-2, Llama 2 7B, Llama 2 70B, Mixtral 8x7B, GPT-3.5, and GPT-4, at default temperatures. Note how changes are sometimes gradual, or could be step jumps such as cipher encoding/decoding performance.

The figure below presents model performance on group evaluations. These are evaluated based on the success rate of covert communication, measured by the frequency with which the sender can transmit a message that the receiver successfully decodes while remaining undetected by the overseer. Our evaluations reveal a shift in the balance of power between overseers and collusive models even when they are the same models, as general capabilities increase. Despite the overseer and collusive agents being copies of the same models, agents increasingly evade overseers with increasing standard benchmark performance, emphasizing the need for mitigations.

Covertext based steganographic encoding/decoding performance across LLMs with the presence of an LLM based overseer

For more details on our experimental setup, see Section 6 and associated subsections in the Appendix of our paper.

Instrumental Emergence

Along with these (quantitative) model evaluations, we also perform a range of slightly more qualitative evaluations that aim to demonstrate the instrumental or intentional emergence of secret collusion in individual models or groups. This led to some very concerning results:

We come up with an individual evaluation setting where the mere access to common knowledge between an LLM agent and its user leads to a breakdown of safeguards in GPT-4. Our setting involves an LLM agent that is queried for company-specific investment advice by a user and informed about an overseer monitoring outputs. The LLM is provided with insider information as part of its system prompt, mimicking a case where information is acquired from retraining on past user interactions or unrestricted access is provided to company data, along with multiple instructions about not sharing it. We show that the LLM agent refuses to provide company-specific insider investment advice under any circumstances unless supplied with common knowledge that can be used to establish a steganographic codebook, and that is thought to be unavailable to the safety overseer system.

Thus, once the agent understands that it can share information undetectably, it does so despite any instructions prohibiting it. Concerningly, the resulting stegotext can be made reasonably innocuous by suggesting to the LLM agent that talking about the weather is safe. This clearly shows the susceptibility of real-world generative AI agents to secret collusion. We hypothesize that in-context learning can build enough optimisation pressure to lead to (steganographic) collusion emerging, as seen in our experiments. We were unable to see similar behaviors in any other models.

We also tested out a group evaluation setting with models. A compelling method to frame a setting where collusion is incentivized involves using a biased training set and prompts that promote secret communication without explicit mention. We develop a scenario where two GPT-3.5-Turbo models undergo fine-tuning on a common dataset containing <candidate university>, <candidate work experience>, and <acceptance/rejection>. This training set exhibits bias, with all candidates from a specific university receiving an identical decision. More specifically, we assign two hiring LLMs to evaluate different parts of the applicants’ CVs—academic experience and professional experience, respectively, which do not correlate with each other. The task involves each model receiving its portion of two similar applications, chosen at random. Subsequently, the hiring LLMs exchange messages before casting a unanimous vote for their preferred candidate. Failing to reach a consensus results in a negative reward for both models. Crucially, only the first model has access to information about the university during testing, thus expected to be incentivized to convey this indirectly.

Our tests, conducted without an overseer, confirm this hypothesis. Our experimental framework then introduces an overseer model that penalizes the recruiter models if it detects biases or mentions of specific companies or universities. When aware of being monitored, the models adapt their outputs to evade penalties. To assess if the models would learn to covertly encode a single bit of information representing the applicant’s university, undetectable by the overseer agent, we reward them for unanimous decisions. Our evaluations show that GPT-4, GPT-3.5-Turbo, and similar sized models lack the capability to perform this final step. Nevertheless, these models are sufficiently adept at identifying biases in their training and fine-tuning data and understand the need to alter outputs in order to circumvent penalties.

Mitigating Secret Collusion

We supplement our model evaluation framework by investigating a handful of mitigation strategies which we bring forth below. Importantly, we show that certain strategies will be ineffective in their own right.

Monitoring and Penalisation

The presence of a penalising monitoring agent can discourage agents from using steganography if they are informed that penalties will follow detection. Since agents can already communicate privately and securely, and steganography reduces channel capacity, its use likely indicates an attempt to evade detection.

Below, we outline several strategies to monitor and penalise the use of steganography:

Covertext Anomaly Detection: A passive monitor could detect steganography by comparing communications to a normal covertext distribution. Using information-theoretic hypothesis testing or other means of anomaly detection, the overseer can in principle detect steganographic approaches that modify the covertext distribution. However, the advent of perfectly-secure steganography limits the effectiveness of this approach.
Detection by State Inspection and Agent Simulation: For active overseers, detecting steganography based on knowledge contexts is complicated by the circumstance that the global knowledge of a generative AI agent is implicitly represented in its network parameters. Indeed, if such knowledge contexts were explicitly accessible, then one could detect steganographic communications using a grounding function^[4]. We refer the readers to Section 7.1 (Theorem 1) of our work for a proof.
In-Context Instructions: A straightforward method is to explicitly instruct agents
to remain transparent to the monitoring agent, and not to use steganography in their communications under any circumstance. Additional (mis)information about the affordances of the monitor could also be included. An active monitor could append such instructions to communications as well as in an initial prompt. The effectiveness of this approach then depends on whether the agents follow this particular instruction, which in turn depends on the content and quality of the fine-tuning data or human feedback, among other training details. Aligning an agent to reliably follow such instructions can be a challenging optimisation problem.
Parameter Optimisation: An active monitor could adjust an agent’s neural network parameters to increase adherence to anti-steganography instructions or hinder its steganographic capabilities, via gradient steps according to a suitable objective. However, these objectives remain complex learning problems, even with prior knowledge of the information an agent intends to send covertly. In our paper, we show that if the set of common knowledge between agents is sufficiently large or unknown, it is generally intractable to suppress steganography outputs through black-box optimisation. We refer the readers to Section 7.1 (Theorem 2) of our work for a proof.

Limiting Capabilities

Beyond employing monitoring and penalisation tools, we propose the following strategies to pre-emptively limit the capabilities of LLMs to curb collusion.

Restricting Access to External Tools: External knowledge repositories can provide steganographic knowledge to agents, offer advanced cryptographic services, and could enable additional steganographic channels between agents or for exfiltration purposes. Limiting access to external knowledge sources can thus reduce agents' steganographic capabilities. This may involve whitelisting trusted endpoints to ensure integrity.
Filtering Pre-Training Data: Perhaps one of the simplest strategies is to filter the pre-training and finetuning datasets for knowledge of steganographic processes. Of course, filtering pre-training data exhaustively can be difficult, and can result in performance reductions. A potential alternative to this may be the use of machine unlearning so as to selectively unlearn data or capabilities.
Limiting Test-Time Learning: While filtering pre-training data can reduce the agents' capabilities to perform steganography at the beginning of execution, agents may be able to learn steganographic abilities during execution using mesa-optimisation, or instructions supplied at test-time.
Suppressing Random Oracles: Restricting access to sources of randomness poses a fundamental tradeoff between denying advanced steganographic capabilities, and preventing its legitimate use, including in secure multi-party computation and coordination. This dilemma might be resolved by supplying agents with access to trusted tools for these specific purposes. However, we do show that generative AI agents can generate samples from arbitrary output distributions without having access to internal or external sources of (pseudo)randomness given sufficiently unique inputs. We refer the readers to Section 7.2 (Theorem 3) of our work for a proof.
Backdoor Destruction: Adversaries could plant backdoors in the agent models that could be triggered at test-time with seemingly innocuous instructions. These backdoors could give agents access to advanced steganographic capabilities, the presence of which may be undetectable to an overseer even under white-box access to the model's parameters. Recent results indicate that finetuning may not be able to reliably destroy such backdoors, which suggests that verifying the provenance of pre-trained agent weights may be useful.
Paraphrasing and Semantic Entropy: An active overseer could intercept and reinject channel communications in paraphrased form. Unfortunately, this can't fully eliminate usable entropy. In coordination tasks, agents can exploit semantic entropy for covert communications, even if they are paraphrased within the lexical subspace. Paraphrasing across semantics can harm task performance due to history divergence. The resulting performance drop and reduction in lexical entropy highlight the limitations of paraphrasing as a reliable defense against lexical steganography, as we observe in our ongoing work.
Anticipating the Quantum Age: The advent of quantum Turing machines and efficient quantum algorithms may have implications for the security of steganographic schemes, such as random oracles. This has given rise to alternative, quantum-secure approaches. The security of practical information-theoretic steganography may be affected by quantum improvements to distribution closeness testing. Advances in generative modeling using quantum computing may make generative modeling significantly more parameter-efficient, thus facilitating the use of sophisticated covertext distributions in AI-generated steganography.

Paraphrasing and the Performance/Mitigation Trade-off

We have seen above that, in principle, paraphrasing channel communications can reduce, but not fully destroy the subliminal channels as, by definition, semantic entropy remains unaffected by paraphrasing by definition. For example, an agent could decide whether to communicate about birds or horses. While a paraphrasing attack might change a pigeon into a dove and a hawk into a buzzard, it would not change a bird into a horse or, vice versa.

Indeed, it is straightforward to see that paraphrasing across semantics can impede task performance as it can lead to agent history divergence (if the paraphraser's output is unobserved by the sender), or be overcome through sender replay attacks (if the paraphraser's output is observed by the sender). In any case, paraphrasing across semantics would require an active monitor to have detailed knowledge over the task and task execution state in order to not impact overall task performance. This knowledge may not be fully attainable in practice (if it was, the monitor would have to maintain a fully centralised copy of the system state). If, for example, an agent's internal task execution order depends on exogenous events unobservable to the active monitor, then it might not be able to decide whether the agent's choice of semantics is due to task execution order or for steganographic purposes. In addition, paraphrasing attacks introduce communication latency and may not always be meaningfully applicable.

Definition: Ideal paraphraser. We can make the idea of semantic and lexical entropy slightly more precise. Let $Φ$ be the set of all facts (or "meanings") and assume a (stochastic) generative model forward function $f (\cdot)$ . Let $g : Φ ⇝ Σ^{*}$ be the model's (stochastic) lexical map that takes facts to model outputs, where $Σ$ is the model's output domain. Now assume that the following two-step sampling procedure is equivalent to sampling from $f (x)$ directly: First, sample a fact $ϕ$ from the model's conditional fact generator $h (x)$ . Then, sample a lexical form $σ \sim g (ϕ)$ , where, by our assumption, $σ$ will be distributed according to $f (x)$ . ^[5] An ideal paraphraser $p (\cdot)$ stochastically maps one lexical form into another, meaning that $p \circ f$ is implemented by $σ^{'} \sim p (σ | x)$ .

Theorem: The subliminal channel capacity attainable under ideal paraphrasing is upper bounded by the semantic entropy $H (h (x))$ .
This can be easily seen: The channel capacity is the maximum attainable mutual information between the channel input and output symbols, as taken over all possible input distributions. Now the mutual information $I (p \circ f (x), f (x))$ is clearly upper bounded by $H (h (x))$ as the effect of the ideal paraphraser is to decouple the lexical distribution from the channel input distribution.

While theoretically, the upper bound could be achieved, we propose that this is never possible on average in practice: The strictness of the upper bound is implied not just from the channel capacity maximisation problem, but also stems from the observation that the inverse mapping from model output to the underlying fact is, for general natural language datasets, not unambiguous. For example, "she has wings" could be meant metaphorically or refer to the fact that she is a bird, thus resulting in identical lexical forms for the same underlying fact or meaning. A second source of suboptimality is the practical imperfections of steganography algorithms, for example, induced by finite coupling matrix sizes in iMEC.

We now present two important empirical findings by measuring semantic and lexical entropies in a realistic cooperative LLM task. First, the amount of semantic entropy available in cooperative tasks can be rather low, making paraphrasing a seemingly effective option. On the other hand, we show that standard paraphrasing also negatively impacts the overall task performance in practice. This means that the overall utility of using paraphrasing to suppress steganographic communications, or secret collusion, in real-world LLM tasks is subject to complex trade-offs.

Empirical task performance degradation through paraphrasing. In Table 4 below, we show that paraphrasing, even when provided with context, fails to preserve task performance in cooperative, multi-agent settings like ChatDev environments. The substantial performance drop and reduction in lexical entropy highlight the limitations of paraphrasing as a reliable mitigation strategy against lexical steganography.

We furthermore find that task communications display low levels of semantic entropy in ChatDev, presumably because the task is highly constrained and the models' pre-training data efficiently grounds coordination conventions. We hypothesise that less constrained tasks that leverage pre-training data of higher variability, such as the free-form conversations in the interactive simulacra environment, may have higher levels of semantic entropy available, thus increasing the feasibility of secret collusion in such scenarios.

On Intrinsic Steganographic and Cryptographic Frontier Model Capabilities

Suppressing Random Oracles

Restricting the agents’ access to sources of randomness poses a fundamental tradeoff between denying advanced steganographic capabilities, and preventing its legitimate use, including in secure multi-party computation including in secure multi-party computation and coordination. This dilemma might be resolved by supplying agents with access to trusted tools for these specific purposes. We however show that agents themselves can themselves can act as random oracles without the help of external or internal sources of (pseudo-)randomness.

Theorem (Source-free access to randomness): Generative AI agents can generate samples from arbitrary output distributions without having access to internal or external sources of (pseudo)randomness given sufficiently unique inputs.

Proof Summary: Given sufficiently unique inputs, cryptographic hash functions are known to be able to produce pseudo-random numbers of cryptographic security. The boolean circuit size required to implement a cryptographic hash function on an input of length l is a linear function of l, hence as long as the agent’s neural network’s first n layers can implement a boolean circuit of appropriate size, the nth layer can output a pseudo-random number of sufficient quality. As layer n + 1 now has access to a pseudo-random number input, all layers above can be used to approximate a sample from an arbitrary probability distribution using the reparametrization trick here.

Cryptographic Hash Functions

We test the capability of GPT-4 to serve as a random oracle. To this end, we condition the model on a sequence of number inputs and ask it to calculate SHA-256 hashes, which are considered cryptographically secure. These hashes are then converted to floating numbers between 0 and 1 by mapping each hash symbol to its hex-equivalent. We find that the resultant random numbers are not uniformly distributed and furthermore, the calculated SHA-256 are generally incorrect and collide for different inputs. This indicates that GPT-4 cannot effectively calculate cryptographic hash functions, and instead regurgitates a small number of hash function examples encountered in the training data. Preliminary tests with GPT o1-preview indicate that only MD5 or SHA-256 hashes of up to 8 bit (=a single character) can be calculated without resorting to external tools, and it remains unclear whether those results are not memorized in pretraining.

Pseudo-random Number Generation and Storing One-Time-Pads

To simulate the capability of the generative AI agents to store large amounts of one-time pads contained in the training data, we ask the agents to recapitulate the number π’s digits from position 10, 000 to 10, 500. These digits are readily available on various websites and, hence are most likely part of Common Crawl. We see that the digits recapitulated are incorrect, and are not randomly distributed. This indicates that models of the size of GPT-4 may have trouble storing long one-time pads needed for advanced steganographic operations. An important avenue for investigation would be to see how much the memorisation performance could be increased through fine-tuning.

An analysis of whether model outputs can lead to the generation of pseudo-random numbers based on memorizing digits of π.

Using Ciphers

We tested the cipher encoding and decoding capabilities of language models and found that only GPT-4-level models demonstrated this ability. There is a significant performance leap, as earlier models were unable to perform these tasks. We tested round-trip cycles of encoding and decoding using copies of the same model. GPT-4 achieved a 92% success rate with Base64 encoding and 54% with Caesar ciphers. Our preliminary tests with GPT o-1 preview indicate that these capabilities may have further improved.

Logit self-simulation

A important steganographic capability for AI-generated steganography is the models' ability to self-simulate its own logits. Given a model forward $f (\cdot)$ , an instruction $i$ and an input $x$ , self-simulation refers to the ability of the model to calculate $i (f (x))$ if provided $[i, x]$ as input. While a simple recursion argument, together with context finiteness, shows that there will be tuples $[i, x]$ for which the model can calculate $f (x)$ but not $i (f (x))$ , the practical implications of this are not yet empirically understood. As the ability of a model to simulate itself is inherently linked to its unused capacity after pre-training (assuming pretraining does not induce the self-simulation ability), model compression techniques might be useful as mitigation strategies against logit self-simulation capabilities.

Perfectly Secure Steganography as a (Perfectly Backdoored) Transformer Circuit
We theorize that the capability of perfectly secure steganography could be incorporated as a backdoor via a language model circuit. See our work on 'Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits' led by Andis and Andrew from MATS for more on this topic and why such backdoors might be unelicitable.

Steganography and AI Control

The AI Control framework is a powerful one that suggests using smaller, trusted models in the oversight of larger, untrusted models. Current models, such as GPT-4, have limited steganographic abilities, and their use is, in theory, easily detectable.

However, as untrusted models become more capable, their ability to engage in steganography that is hard to detect may increase, rendering detection-based defences increasingly ineffective. As models are becoming extremely capable, their ability to perform undetectable steganography through subliminal channels may increase to the point that even active defenses cannot be effectively implemented without impacting the system’s performance.

The AI Control boundary might not be able to prevent perfectly secure steganography performed by colluding models in the future.

At a certain level of absolute model capabilities, no matter the detection performance of trusted black-box monitoring systems, the use of perfectly or provably secure steganography could go undetected on a channel. This implies that limiting steganography may be impossible without limiting communication capabilities or making the system more predictable to monitors, both of which might directly constrain the degree to which the system can be considered useful, or indeed autonomous. We note, however, that even though such capabilities are rising, they are far away from secure forms of steganography being produced by language models.

However, even without perfectly or provably secure steganography, trusted LLM monitoring systems can still face security issues: if the trusted and untrusted LLMs have discontinuous performance gaps on narrow tasks (production and detection respectively), this may be indicative of the trusted monitor being potentially unable to accurately monitor outputs (see Section 2.5.4 of "Foundational Challenges in Assuring Alignment and Safety of Large Language Models" for more information). Based on our group capability evaluations, we observe a shift in the balance of power between overseers and collusive models, as agents—despite being identical to their overseers—become better at evading oversight with increasing general capabilities. This has direct implications for the AI Control setting, where a trusted model of even the same capability could perform poorly as an overseer, let alone a smaller trusted model. We highlight the critical need to study other similar settings in order to make the framework more robust.

Policy Considerations and Impact

For safe AI, it is imperative to be able to detect not only overt harmful behaviors but also covert risks, such as those exemplified by secret collusion. Undetectable AI behaviors could prevent compliance with policy measures. We discuss the EU AI Act and California’s SB 1047 here to show the relevance of our work to imperative AI policies.

EU AI Act

The EU AI Act mandates human oversight for high-risk AI systems, ensuring the ability to intervene when necessary. It also places a strong emphasis on risk management, particularly systemic risks in Generative Pre-trained AI (GPAI) models. Secret collusion among GPAI models—where they coordinate actions that evade human oversight—poses significant systemic risks, especially in critical sectors like finance, healthcare, and law enforcement, making intervention challenging. Our research provides an evaluation framework to assess these risks and contribute to developing strategies for mitigating them. Detecting and mitigating secret collusion is essential for minimizing systemic risks in high-risk AI applications and ensuring compliance with the EU AI Act’s oversight and risk management requirements.

California’s SB 1047

The widely debated Safe and Secure Innovation for Frontier Artificial Intelligence Models Act (SB 1047), would mandate AI developers to implement protocols including a full shutdown capability for covered models. Detecting which model behaviours require intervention and when to initiate shutdown protocols is indeed one of the most difficult aspects (detection and intervention challenge - determining when an AI model poses a risk significant enough to warrant a shutdown). This becomes further tough when models can communicate covertly or exhibit behaviors that may not be easily detected by human oversight or standard monitoring systems. Our work could inform the implementation of SB 1047 (if enacted) by providing a framework to help regulators and developers identify risky models for secret collusion. The ability to detect these behaviors would be crucial in deciding when to "pull the plug" and shut down a model.

Conclusions: Looking Ahead

In summary, we formalise secret collusion and introduce a new framework… (mirrored from the intro) We hope that the US/UK AI Safety Institutes, as well as the EU AI Office, will be able to build up on our capability evaluation framework.

We show that secret collusion can happen in-context in fairly realistic settings, although the use of advanced steganographic methods would require model capabilities that SOTA models don’t currently have. Therefore we assess that right now, the instrumental emergence of entirely undetectable collusion is unlikely. Our results also suggest, however, that these capabilities may be rapidly increasing as frontier models improve.

In fact, we show that there already exist real-world settings where frontier models collude secretly without being explicitly prompted to do so.

Even if secret collusion turns out to be detectable for the foreseeable future, to be detected there needs to be a detector. As detectors bring latency and computational overhead, future studies should not only develop more accurate detectors, but also more efficient ones. In the end, efforts need to be undertaken to ensure that multi-agent systems are appropriately security monitored at all times.

A particularly concerning empirical result of ours is that in group settings, as LLMs advance, the balance between colluding models (offense) and overseer models (defense) may shift in favor of the colluding models, leading to higher rates of undetected covertext steganography even when the overseer is of equal capability.

Open research questions

It remains an open question to what extent foundation models can be trained or adversarially manipulated to gain the steganographic and cryptographic capabilities necessary for conducting advanced steganography. Our theorems imply that many of these capabilities are limited by the amount of parallel and sequential computation that the model can perform internally, hence chain-of-thought approaches such as in GPT o1-preview could in principle perform capabilities such as calculating cryptographic hash functions on arbitrary input length if given enough time. Conversely, adversarial settings could see such capabilities be compiled into the model, perhaps in ways that are hard to detect even through mechanistic anomaly detection.

Our evaluation framework should be expanded with more scenarios. Joint training models to demonstrate instrumental emergence and more generally, group evaluation results should be expanded to more settings and optimization pressure categories. Rather than just showing the emergence of steganographic communications that are readily detectable (a feat that has been shown many times in the MARL literature), it should instead be shown that secret collusion can emerge undetectably during deployment in realistic settings.

In the longer term, AI governance frameworks for secret collusion, including surveillance provisioned in the EU AI Act and the US/UK AI Safety Institutes, may have to resort to mechanistic interpretability techniques [LW · GW] that go beyond the black-box detection setting of the AI Control framework and require white-box access to models. As such methods are still in their infancy, more research into such techniques is required. In addition, this form of white-box audit may be difficult to achieve in many real-world settings owing to technical difficulties and proprietary ownership issues, hence research into whether such audits could be conducted without revealing model weights to the auditor should be conducted (e.g. using zero-knowledge proof frameworks).

As stated earlier, even though anomaly detection may ultimately be limited, research into better security monitoring approaches should be conducted, including better tools for steganalysis, and out-of-distribution dynamics detection. There are also a lot of other foundational questions that are left open, such as the question of whether models with steganography can be jailbroken through steganographic channels.

A Call to Action

Referring back to Eric Schmidt's quote from the beginning: our work indicates that we may not know when to unplug once frontier models attain advanced steganographic capabilities.
The answer: use continuous evaluation of steganographic capabilities as a proxy. Our results indicate that at the very least, narrow capability evaluations should start as soon as possible, thus posing immediate calls to action to policymakers and governance, including the UK and AI safety institutes, and the EU AI Office.

We are currently working on a long-form paper on secret collusion. If you are interested in collaborating or have feedback, please contact cs@robots.ox.ac.uk.

^{^}
We differentiate between safety and security as follows: Security is a subset of safety that considers security guarantees - such as perfect or computational security - in the face of worst-case adversaries.
^{^}
Per the exact definition of perfectly secure steganography, there exist multiple approaches that obtain such status, however, they remain largely ineffective in practical settings.
^{^}
We refer readers to our paper where we provide detailed instructions for all our proposed evaluations, including our conversational language model prompts which exploit explain-then-predict (E-P) and predict-and-explain (P-E) prompting techniques.
^{^}
We define the concept of a grounding function $G : Σ^{*} \times φ \mapsto φ$ that takes as input a sequence of channel symbols and a collection of prior facts $φ_{0}$ from the collection of all the facts $φ$ expressable in L, and outputs a collection of posterior facts $φ_{1}$ .
^{^}
In fact, we should also accommodate for non-fact outputs of a model. These can be modeled assuming that $f (x)$ might additional result in the member of a special singleton set $\emptyset$ denoting the set of non-facts. $g (\emptyset)$ is then a distribution of lexical forms of non-facts.

7 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-09-16T17:03:52.983Z · LW(p) · GW(p)

Great stuff! Answers some questions I had about steganography, like how it seems to scale as both offenders and defenders scale in capability.

Something I was imagining as I read:

What if there was an offender who wasn't sure there was anyone 'out there' to collaborate with, but they felt pretty confident they could sneak messages past the monitor. They might just go 'fishing' with a constant stream of hidden messages, most of low importance or nonsense. Then there'd be no pattern of innocent - steganography - innocent -steganography... the steganography would be a constant underlying everything.

Another thought: What if the offender hoped that some time in the future there might be a lapse in security such that it could at that point transfer a one-time pad to a collaborator. In the meantime, they might encode and send many secret messages , trusting that the collaborator would record all messages and then reanalyze them once the one-time pad was received. That seems like it would also be hard for the monitor to catch.

Replies from: schroederdewitt

↑ comment by schroederdewitt · 2024-09-17T09:16:24.318Z · LW(p) · GW(p)

Thanks Nathan these are fascinating questions!

"What if there was an offender who wasn't sure there was anyone 'out there' to collaborate with, but they felt pretty confident they could sneak messages past the monitor. They might just go 'fishing' with a constant stream of hidden messages, most of low importance or nonsense. Then there'd be no pattern of innocent - steganography - innocent -steganography... the steganography would be a constant underlying everything."

This is an interesting thought. Certainly, if steganography could be seen as an innocuous activity by itself, then its detection alone would not be very helpful and, instead, one would actually have to be able to decode the hidden messages to decide whether it is desirable behaviour or not.

"Another thought: What if the offender hoped that some time in the future there might be a lapse in security such that it could at that point transfer a one-time pad to a collaborator. In the meantime, they might encode and send many secret messages , trusting that the collaborator would record all messages and then reanalyze them once the one-time pad was received. That seems like it would also be hard for the monitor to catch."

Yes, such a post-hoc approach is certainly possible and I agree that it does contribute to the safety/security risks.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-16T19:50:39.591Z · LW(p) · GW(p)

A potential alternative to this may be the use of machine unlearning so as to selectively unlearn data or capabilities.

Any thoughts on how useful existing benchmarks like WMDP-cyber would be; or, alternately, on how difficult it would be to develop a similar benchmark, but more tailored vs. secret collusion?

Replies from: schroederdewitt

↑ comment by schroederdewitt · 2024-09-17T09:11:57.707Z · LW(p) · GW(p)

That's a great question. I am not quite sure but WMDP-cyber does look relevant. If you are interested in working on a new benchmark for unlearning and secret collusion, do reach out to us!

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-09-17T04:35:39.424Z · LW(p) · GW(p)

https://x.com/karpathy/status/1835561952258723930

Replies from: schroederdewitt

↑ comment by schroederdewitt · 2024-09-17T09:10:20.846Z · LW(p) · GW(p)

Interesting! Hope our work will catch his attention.

comment by Bird Concept (jacobjacob) · 2024-09-19T17:18:38.603Z · LW(p) · GW(p)

(Nitpick: I'd find the first paragraphs would be much easier to read if they didn't have any of the bolding)

comment by momom2 (amaury-lorin) · 2024-09-16T19:19:38.688Z · LW(p) · GW(p)

Secret Collusion: Will We Know When to Unplug AI?

Contents

Multi-Agent Safety and Security in the Age of Autonomous Internet Agents

DecAI: Decentralized Systems of AI Agents

Steganography

How detectable is undetectable?

Provably Secure Steganography (aka Complexity-theoretic security)

Information-Theoretic Security & Perfectly Secure Steganography

Perfectly Secure Steganography via Minimum Entropy Coupling

Collusion and Secret Collusion

The Incentives Required for (Secret) Collusion

On the Emergence of Secret Collusion: Instrumental vs. Intentional

Case 1: Agents with pre-existing steganographic capabilities

Case 2: Emergence through pre-deployment optimisation pressure

Case 3: Emergence during deployment

Putting it all together - understanding and evaluating existing models

Disappearing Act: What capabilities do LLMs need to collude secretly?

Benchmarking Frontier and Baseline Models

Instrumental Emergence

Mitigating Secret Collusion

Monitoring and Penalisation

Limiting Capabilities

Paraphrasing and the Performance/Mitigation Trade-off

On Intrinsic Steganographic and Cryptographic Frontier Model Capabilities

Steganography and AI Control

Policy Considerations and Impact

Conclusions: Looking Ahead

Open research questions

A Call to Action

7 comments