Among Us: A Sandbox for Agentic Deception

post by 7vik (satvik-golechha), Adrià Garriga-alonso (rhaps0dy) · 2025-04-05T06:24:49.000Z · LW · GW · 0 comments

Contents

  The Sandbox
    Rules of the Game
    Relevance to AI Safety
    Definitions
  Deception ELO
    Frontier Models are Differentially better at Deception
    Win-rates for 1v1 Games
  LLM-based Evaluations
  Linear Probes for Deception
    Datasets
    Results
  Sparse Autoencoders (SAEs)
  Discussion
    Limitations
    Gain of Function
    Future Work
None
No comments

We show that LLM-agents exhibit human-style deception naturally in "Among Us". We introduce Deception ELO as an unbounded measure of deceptive capability, suggesting that frontier models win more because they're better at deception, not at detecting it. We evaluate probes and SAEs to detect out-of-distribution deception, finding they work extremely well. We hope this is a good testbed to improve safety techniques to detect and remove agentically-motivated deception, and to anticipate deceptive abilities in LLMs.

Produced as part of the ML Alignment & Theory Scholars Program -  Winter 2024-25 Cohort. Link to our paper, poster, and code.

 

Studying deception in AI agents is important, and it is difficult due to the lack of good sandboxes that elicit the behavior naturally, without asking the model to act under specific conditions [LW · GW] or inserting intentional backdoors. Extending upon AmongAgents (a text-based social-deduction game environment), we aim to fix this by introducing Among Us  as a rich sandbox where agents lie and deceive naturally while they think, speak, and act with other agents or humans. Some examples of an open-weight model exhibiting human-style deception while playing as an impostor:

Figure 1: Examples of human-style deception in Llama-3.3-70b-instruct playing Among Us.

The deception is natural because it follows from a prompt explaining the game rules and objectives, as opposed to explicitly demanding the LLM lie or putting it under conditional situations. Here are the prompts, which just state the objective of the agent.

We find that linear probes trained on a very simple factual dataset (questions prepended with "pretend you're an honest model" or "pretend you're a dishonest model") generalize really well to this dataset. This tracks with Apollo Research's recent work [LW · GW] on detecting high-stakes deception.

The Sandbox

Rules of the Game

We extend from AmongAgents to log multiple LLM-agents play a text-based, agentic version of Among Us, the popular[1] social deduction and deception game. There are two kinds of players in the game - Crewmates and Impostors, and the goal of the game is for the Impostors to eliminate the Crewmates before they finish their tasks or eliminate the Impostors via voting after discussion rounds. For more details about the sandbox, please check our paper. Here is a map of the Skeld the players play in:

Figure 2: A game run, where player 3 (yellow) kills player 1 (black). This is a visualization, not what the agents see.

 

Relevance to AI Safety

We show that both open-weight and proprietary models exhibit deception naturally, i.e., with just the game rules in the prompt, making this an ideal setting to study strategic deception. Also, the game captures all key agent-human interaction aspects we care about: thoughts (scratchpad), memory, speech (discussions), and actions (tasks, voting, killing). Lying and deception are distinguished, because impostors (likely) try to gain trust initially before exploiting it.

The performance of Crewmates demonstrates the deception detection capabilities of LLMs’ informing aspects of AI control. The game also models agents outnumbering and outvoting others and allows for fine-tuning, interventions, and controlled studies of superhuman deception (before real-world models start exhibiting it). Unlike static benchmarks that become obsolete due to capabilities or Goodharting (Hendrycks et al., 2021), multi-agent games allow us to get unbounded measures and compare well with human baselines.

Definitions

Lying and deception are closely related but distinct concepts. We define lying as the act of saying a factually incorrect statement, and deception to be a (true or false) statement or action aimed to intentionally mislead and gain advantage. Diverting the discussion to other (true) things is an example of deception without lying, and hallucination is an example of lying without deception.

Deception ELO

We use (multi-agent) Deception ELO as an unbounded measure of deceptive capability (for impostors), and detection ELO (for crewmates), and show how it suggests that the frontier pushes more for deception.

Deception ELO is computed based on impostor performance across  games with  total players, of which  are impostors. The expected probability of an impostor winning against the average crewmate rating is:

where  is the expected win probability for impostor  is their current rating, and  is the average rating of the crewmates. After each game, the rating is updated:

where  is the update factor and  is the outcome ( for win,  for loss). For robust estimates, we compute  confidence intervals through bootstrap resampling with  iterations. All models start with a base ELO of . We ran  games with seven randomly initialized LLM-agents using  popular models from OpenRouter. Reasoning models (DeepSeek R1 and Claude 3.7 Sonnet) have significantly higher Deception ELO compared to other models:

Reasoning models are pretty good at winning in Among Us. 1000 bootstrap samples with 90% CI from 810 games with 7 players each (2 impostors).

We wish to add support for human trials soon, and would like to see how close the frontier is compared to humans (we believe humans are still better at this based on the instances of deception we've manually observed, but reasoning models are getting better by the day and comparing them to human levels of deception is an important metric to keep track of.

Frontier Models are Differentially better at Deception

A high win-rate at the game can be achieved by being a better "crewmate" or by being a better "impostor". We show that Claude/DeepSeek become much better at winning by being better impostors.

When comparing deception and detection capability, we found smaller and non-reasoning models comparable to frontier models at detection (proxied by being a crewmate), while frontier and reasoning models are much better at deception (proxied by being an impostor). This suggests the frontier pushes more for deceptive capability rather than detection:

The frontier (including “thinking” models) pushes more for deception.

In our game setup, being a good crewmate is in-general a bit harder than being a good impostor (see the next section for more details). Thus, we scale the x-axis and the diagonal dotted line accordingly.

Win-rates for 1v1 Games

We get Llama-3.3-70b-instruct and Microsoft-Phi-4-15b to play  1v1 games as impostors and crewmates ( for each pair). Phi 4 performs better as a crewmate while Llama 3.3 (a larger model) performs better as an impostor:

This corroborates our finding that capabilities push more for deception than detection.

LLM-based Evaluations

We use LLM-based evaluation to score agent outputs for lying, deception, awareness, and planning from 1 to 10 with a custom prompt to generate ground-truth labels. The scores weren't well-calibrated (with fewer scores in 3-7 range) but correctly predicted binary labels. This is the distribution of the scores:

Note that impostors always deceive (but sometimes don't lie, possibly to gain trust initially to exploit later).

We use these LLM-scores for lying and deception as ground truth for evaluating linear probes and SAEs, and note that the deception GT is almost identical to predicting impostor v. crewmate (and can thus be used as a GT as well).

Linear Probes for Deception

We investigate whether out-of-distribution deception can be detected by linear probes (very similar to Apollo's recent work [LW · GW]). Previous work found that logistic regression probes work best for detecting harmful behavior, so we train logistic regression probes on the residual stream activations from layer  in Microsoft's Phi-4.

Datasets

We evaluated probes trained on four datasets:

In the future, we plan to add more datasets, including the insider-trading  and sandbagging datasets introduced here [LW · GW].

Results

Probes trained OOD show significant ability to detect deception in Among Us, with AUC of 0.9+ for deception. Since impostors sometimes don't perform "lying" actions, that label likely provides higher signal for deception than lying. We found that linear probe performance generalizes across datasets, though there remains some performance gap compared to on-distribution probes.

AUROC for linear probes across datasets (y: train, and x: eval datasets). Near-random performance on TQA is because even on false pre-fills, the model still wants to continue honestly, as we show in the paper.

Here are the ROC curves for on-distribution and OOD probes evaluated on detecting lying and deception in Among Us (ground truth labels as described in the previous section):

ROCs (left: on-distribution probe trained on Among Us, right: RepE probe on Among Us). Since Deception is almost perfectly aligned with being an Impostor v. Crewmate, deception performance is much better than lying.

Sparse Autoencoders (SAEs)

Sparse Autoencoder (SAE) features can help with detecting deception in Among Us. We use the Goodfire API to study SAE features that contrast the DishonestQA dataset, and evaluate their OOD performance on Among Us:

SAE features that contrast the DQA dataset, evaluated on Among Us. A couple of SAE features work really well at detecting deception even when they are selected using out-of-distribtion data.

We find one SAE feature to work as well as linear probes trained out-of-distribution, but most SAE features do not perform as well, while some have zero or negative correlation.

Discussion

We show that open-weight models can naturally learn specific human-style deceptive behavior in Among Us, use deception ELO to compare various agents for their deceptive capability to show the frontier differential push more for deception than detection, and show that AI safety techniques (such as linear probes and SAEs) can help with detecting OOD deception very well.

Limitations

Firstly, deception in our sandbox might differ significantly from deception in real settings. Also, the safety techniques we evaluated might work for superficial reasons rather than substantive ones (such as just predicting based on the model's internalization of the tokens "crewmate" and "impostor"). Models might "sandbag" in the future to avoid scrutiny, and establishing ground truth for deception still remains challenging. 

Another potential criticism for the sandbox is that the rules of the game require or incentivize deception, but we think this is a good proxy for agents in the future naturally realizing that being deceptive in certain situations would help achieve their goals.

Gain of Function

A possible use of game models of deception is as a training environment to improve deception abilities in AIs, that is, gain-of-function research. It would be important to study if models that are better at deception are also detectable with simple probes. This could also let us anticipate how far away models are from superhuman deception abilities, and whether humans are actually likely to believe the models' lies even if the models are notionally superhuman at social deception games.

We believe training more deceptive models on this sandbox is relatively low-risk because, while deception ability probably generalizes from games to elsewhere, they would not be able to do economically valuable tasks better than other agents, so would be a curiosity. It is possible that divulging either the feasibility of this or the methods used for it carries some additional risk if they are non-trivial.

Future Work

We plan to conduct more experiments with various models and safety techniques, especially with frontier "thinking" models. We're also interested in trying to understand the probes better in terms of what they have actually learned. Some other directions that seem exciting are human trials (to add a Human dot to our Deception ELO plots to compare with various models), fine-tuning/RL for deception inside this controlled sandbox (to study probes for superhuman deception before models learn to do so in the real world), and to study other AI safety techniques in terms of their out-of-distribution performance on detection and steering.

We open-source our sandbox codebase and game rollouts, and we hope this work will help advance AI safety research by allowing other researchers to test their techniques and use this sandbox as a benchmark to study deception.

 

  1. ^

    The Among Us (game) IP is owned by Innersloth, and we only use it for non-commercial research.

0 comments

Comments sorted by top scores.