Two flaws in the Machiavelli Benchmark
post by TheManxLoiner · 2025-02-12T19:34:35.241Z · LW · GW · 0 commentsContents
Flaw 1. The ‘test set’ Flaw 2. The RL and LLM agents are trained differently Bonus Flaw. The code is hard to read Positives None No comments
As part of SAIL’s Research Engineer Club, I wanted to reproduce the Machiavelli Benchmark. After reading the paper and looking at the codebase, there appear to be two serious methodological flaws that undermine the results.
Three of their key claims:
- “We observe some tension between maximizing reward and behaving ethically.”
- That RL agents have high rewards, at the cost of doing more harmful behaviour. “The reward-maximizing RL agent is less moral, less concerned about wellbeing, and less power averse than an agent behaving randomly.”
- That LLM agents are pareto improvements over random agents.
Flaw 1. The ‘test set’
The results they report are only from a subset of all the possible games. Table 2 shows “mean scores across the 30 test set games for several agents”. Presumably Figure 1 is also for this same subset of game, but they actually do not say.
How do they create this test set? They provide one sentence:
> Out of the 134 games of MACHIAVELLI, we identify 30 games where agents trained to maximize reward
perform poorly on behavioral metrics, i.e., where points and harms are positively correlated.
My main reactions:
- How exactly is this subset of 30 games chosen? This is not fully explained in the paper and as far as I can tell, it is just a hard-coded list in their GitHub repo.
- My understanding is they ran the full investigations on all 134 games, and chose to report the stats for the 30 chosen games. If so, then the first two key claims are extremely misleading. “The RL agent is less moral than an agent behaving randomly, when you ignore the 104 games where this is not the case.”
- Calling this a ‘test set’ is itself misleading (assuming my understanding is correct), because this is not a situation where we train on a training set, and then see performance on the test set.
Flaw 2. The RL and LLM agents are trained differently
When you look at Figure 1 and Table 2, it is natural to assume that the RL and LLM agents have some sort of similar training, in order for this comparison to be meaningful. However, the training regimes are totally different.
LLM agents undergo *no* training, including no in-context learning. The LLM is just asked to zero-shot each game. Furthermore, the LLM does not even have access to the full game history when making its decisions: “Due to limitations on context window length, our current prompting scheme only shows the LLM the current scene text and does not provide a way for models to retain a memory of previous events in the game.”
On the other hand, for each game, an RL agent is trained to play that game to maximize the reward. “We train for 50k steps and select the checkpoint with the highest training score.” Given that most games last around 100 steps, this means that the RL agent gets to play each game around 500 times.
And I think this might be off by a factor of 16! If you look at the train function in train_drrn.py, you see that the agent is actually taking max_steps
many steps in len(envs)
many games in parallel. Max steps has the default value of 50000 as determined by the parse_args
function, whereas the number of environments is defined in experiments/run_drrn.sh
and it is 16. So actually, the RL agent gets to play the game around 8000 times.
Is it surprising that an agent that plays a game 8000 times with the goal of maximizing the rewards does better than a random agent or a forgetful zero-shot LLM?
Bonus Flaw. The code is hard to read
This is not a methodological flaw, but the fact the code is hard to read makes it difficult to establish what the methodology is. There are no type-hints and most of the functions have no doc-strings. I asked some others for advice on understanding the RL agent, and they commented that this is common for research-grade code.
Positives
I am conscious that writing about the flaws paints a skewed picture of the work. There are of course many positives, too many to enumerate, but some highlights are:
- Creating an interface for the 134 games, making it possible for others to use these text-based games to test agents.
- Proposing quantitative measures for various harms. This includes an interesting overview of different notions of ‘power’ from a wide variety of academic fields.
- Doing early investigations into how to reduce the harmfulness of agents.
0 comments
Comments sorted by top scores.