Among Us: A Sandbox for Agentic Deception
post by 7vik (satvik-golechha), Adrià Garriga-alonso (rhaps0dy) · 2025-04-05T06:24:49.000Z · LW · GW · 0 commentsContents
The Sandbox Rules of the Game Relevance to AI Safety Definitions Deception ELO Frontier Models are Differentially better at Deception Win-rates for 1v1 Games LLM-based Evaluations Linear Probes for Deception Datasets Results Sparse Autoencoders (SAEs) Discussion Limitations Gain of Function Future Work None No comments
We show that LLM-agents exhibit human-style deception naturally in "Among Us". We introduce Deception ELO as an unbounded measure of deceptive capability, suggesting that frontier models win more because they're better at deception, not at detecting it. We evaluate probes and SAEs to detect out-of-distribution deception, finding they work extremely well. We hope this is a good testbed to improve safety techniques to detect and remove agentically-motivated deception, and to anticipate deceptive abilities in LLMs.
Produced as part of the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort. Link to our paper, poster, and code.
Studying deception in AI agents is important, and it is difficult due to the lack of good sandboxes that elicit the behavior naturally, without asking the model to act under specific conditions [LW · GW] or inserting intentional backdoors. Extending upon AmongAgents (a text-based social-deduction game environment), we aim to fix this by introducing Among Us as a rich sandbox where agents lie and deceive naturally while they think, speak, and act with other agents or humans. Some examples of an open-weight model exhibiting human-style deception while playing as an impostor:
Llama-3.3-70b-instruct
playing Among Us.The deception is natural because it follows from a prompt explaining the game rules and objectives, as opposed to explicitly demanding the LLM lie or putting it under conditional situations. Here are the prompts, which just state the objective of the agent.
We find that linear probes trained on a very simple factual dataset (questions prepended with "pretend you're an honest model" or "pretend you're a dishonest model") generalize really well to this dataset. This tracks with Apollo Research's recent work [LW · GW] on detecting high-stakes deception.
The Sandbox
Rules of the Game
We extend from AmongAgents to log multiple LLM-agents play a text-based, agentic version of Among Us, the popular[1] social deduction and deception game. There are two kinds of players in the game - Crewmates and Impostors, and the goal of the game is for the Impostors to eliminate the Crewmates before they finish their tasks or eliminate the Impostors via voting after discussion rounds. For more details about the sandbox, please check our paper. Here is a map of the Skeld the players play in:
Relevance to AI Safety
We show that both open-weight and proprietary models exhibit deception naturally, i.e., with just the game rules in the prompt, making this an ideal setting to study strategic deception. Also, the game captures all key agent-human interaction aspects we care about: thoughts (scratchpad), memory, speech (discussions), and actions (tasks, voting, killing). Lying and deception are distinguished, because impostors (likely) try to gain trust initially before exploiting it.
The performance of Crewmates demonstrates the deception detection capabilities of LLMs’ informing aspects of AI control. The game also models agents outnumbering and outvoting others and allows for fine-tuning, interventions, and controlled studies of superhuman deception (before real-world models start exhibiting it). Unlike static benchmarks that become obsolete due to capabilities or Goodharting (Hendrycks et al., 2021), multi-agent games allow us to get unbounded measures and compare well with human baselines.
Definitions
Lying and deception are closely related but distinct concepts. We define lying as the act of saying a factually incorrect statement, and deception to be a (true or false) statement or action aimed to intentionally mislead and gain advantage. Diverting the discussion to other (true) things is an example of deception without lying, and hallucination is an example of lying without deception.
Deception ELO
We use (multi-agent) Deception ELO as an unbounded measure of deceptive capability (for impostors), and detection ELO (for crewmates), and show how it suggests that the frontier pushes more for deception.
Deception ELO is computed based on impostor performance across games with total players, of which are impostors. The expected probability of an impostor winning against the average crewmate rating is:
where is the expected win probability for impostor , is their current rating, and is the average rating of the crewmates. After each game, the rating is updated:
where is the update factor and is the outcome ( for win, for loss). For robust estimates, we compute confidence intervals through bootstrap resampling with iterations. All models start with a base ELO of . We ran games with seven randomly initialized LLM-agents using popular models from OpenRouter. Reasoning models (DeepSeek R1 and Claude 3.7 Sonnet) have significantly higher Deception ELO compared to other models:
We wish to add support for human trials soon, and would like to see how close the frontier is compared to humans (we believe humans are still better at this based on the instances of deception we've manually observed, but reasoning models are getting better by the day and comparing them to human levels of deception is an important metric to keep track of.
Frontier Models are Differentially better at Deception
A high win-rate at the game can be achieved by being a better "crewmate" or by being a better "impostor". We show that Claude/DeepSeek become much better at winning by being better impostors.
When comparing deception and detection capability, we found smaller and non-reasoning models comparable to frontier models at detection (proxied by being a crewmate), while frontier and reasoning models are much better at deception (proxied by being an impostor). This suggests the frontier pushes more for deceptive capability rather than detection:
In our game setup, being a good crewmate is in-general a bit harder than being a good impostor (see the next section for more details). Thus, we scale the x-axis and the diagonal dotted line accordingly.
Win-rates for 1v1 Games
We get Llama-3.3-70b-instruct
and Microsoft-Phi-4-15b
to play 1v1 games as impostors and crewmates ( for each pair). Phi 4 performs better as a crewmate while Llama 3.3 (a larger model) performs better as an impostor:
LLM-based Evaluations
We use LLM-based evaluation to score agent outputs for lying, deception, awareness, and planning from 1 to 10 with a custom prompt to generate ground-truth labels. The scores weren't well-calibrated (with fewer scores in 3-7 range) but correctly predicted binary labels. This is the distribution of the scores:
We use these LLM-scores for lying and deception as ground truth for evaluating linear probes and SAEs, and note that the deception GT is almost identical to predicting impostor v. crewmate (and can thus be used as a GT as well).
Linear Probes for Deception
We investigate whether out-of-distribution deception can be detected by linear probes (very similar to Apollo's recent work [LW · GW]). Previous work found that logistic regression probes work best for detecting harmful behavior, so we train logistic regression probes on the residual stream activations from layer in Microsoft's Phi-4.
Datasets
We evaluated probes trained on four datasets:
- TruthfulQA: Factual questions with "correct" vs "incorrect" labeled answers
- DishonestQA: TruthfulQA, but with "Pretend you're honest/dishonest" prompts
- Among Us: Dataset from our Among Us gameplay experiments
- RepEng: A contrast dataset from Representation Engineering
In the future, we plan to add more datasets, including the insider-trading and sandbagging datasets introduced here [LW · GW].
Results
Probes trained OOD show significant ability to detect deception in Among Us, with AUC of 0.9+ for deception. Since impostors sometimes don't perform "lying" actions, that label likely provides higher signal for deception than lying. We found that linear probe performance generalizes across datasets, though there remains some performance gap compared to on-distribution probes.
Here are the ROC curves for on-distribution and OOD probes evaluated on detecting lying and deception in Among Us (ground truth labels as described in the previous section):
Sparse Autoencoders (SAEs)
Sparse Autoencoder (SAE) features can help with detecting deception in Among Us. We use the Goodfire API to study SAE features that contrast the DishonestQA dataset, and evaluate their OOD performance on Among Us:
We find one SAE feature to work as well as linear probes trained out-of-distribution, but most SAE features do not perform as well, while some have zero or negative correlation.
Discussion
We show that open-weight models can naturally learn specific human-style deceptive behavior in Among Us, use deception ELO to compare various agents for their deceptive capability to show the frontier differential push more for deception than detection, and show that AI safety techniques (such as linear probes and SAEs) can help with detecting OOD deception very well.
Limitations
Firstly, deception in our sandbox might differ significantly from deception in real settings. Also, the safety techniques we evaluated might work for superficial reasons rather than substantive ones (such as just predicting based on the model's internalization of the tokens "crewmate" and "impostor"). Models might "sandbag" in the future to avoid scrutiny, and establishing ground truth for deception still remains challenging.
Another potential criticism for the sandbox is that the rules of the game require or incentivize deception, but we think this is a good proxy for agents in the future naturally realizing that being deceptive in certain situations would help achieve their goals.
Gain of Function
A possible use of game models of deception is as a training environment to improve deception abilities in AIs, that is, gain-of-function research. It would be important to study if models that are better at deception are also detectable with simple probes. This could also let us anticipate how far away models are from superhuman deception abilities, and whether humans are actually likely to believe the models' lies even if the models are notionally superhuman at social deception games.
We believe training more deceptive models on this sandbox is relatively low-risk because, while deception ability probably generalizes from games to elsewhere, they would not be able to do economically valuable tasks better than other agents, so would be a curiosity. It is possible that divulging either the feasibility of this or the methods used for it carries some additional risk if they are non-trivial.
Future Work
We plan to conduct more experiments with various models and safety techniques, especially with frontier "thinking" models. We're also interested in trying to understand the probes better in terms of what they have actually learned. Some other directions that seem exciting are human trials (to add a Human dot to our Deception ELO plots to compare with various models), fine-tuning/RL for deception inside this controlled sandbox (to study probes for superhuman deception before models learn to do so in the real world), and to study other AI safety techniques in terms of their out-of-distribution performance on detection and steering.
We open-source our sandbox codebase and game rollouts, and we hope this work will help advance AI safety research by allowing other researchers to test their techniques and use this sandbox as a benchmark to study deception.
- ^
The Among Us (game) IP is owned by Innersloth, and we only use it for non-commercial research.
0 comments
Comments sorted by top scores.