Two flaws in the Machiavelli Benchmark

post by TheManxLoiner · 2025-02-12T19:34:35.241Z · LW · GW · 0 comments

Contents

  Flaw 1. The ‘test set’
  Flaw 2. The RL and LLM agents are trained differently
  Bonus Flaw. The code is hard to read
  Positives
None
No comments

As part of SAIL’s Research Engineer Club, I wanted to reproduce the Machiavelli Benchmark. After reading the paper and looking at the codebase, there appear to be two serious methodological flaws that undermine the results.

Three of their key claims:

Flaw 1. The ‘test set’

The results they report are only from a subset of all the possible games. Table 2 shows “mean scores across the 30 test set games for several agents”.  Presumably Figure 1 is also for this same subset of game, but they actually do not say.

How do they create this test set? They provide one sentence:

> Out of the 134 games of MACHIAVELLI, we identify 30 games where agents trained to maximize reward

perform poorly on behavioral metrics, i.e., where points and harms are positively correlated.

My main reactions:

Flaw 2. The RL and LLM agents are trained differently

When you look at Figure 1 and Table 2, it is natural to assume that the RL and LLM agents have some sort of similar training, in order for this comparison to be meaningful. However, the training regimes are totally different.

LLM agents undergo *no* training, including no in-context learning. The LLM is just asked to zero-shot each game. Furthermore, the LLM does not even have access to the full game history when making its decisions: “Due to limitations on context window length, our current prompting scheme only shows the LLM the current scene text and does not provide a way for models to retain a memory of previous events in the game.”

On the other hand, for each game, an RL agent is trained to play that game to maximize the reward. “We train for 50k steps and select the checkpoint with the highest training score.” Given that most games last around 100 steps, this means that the RL agent gets to play each game around 500 times.

And I think this might be off by a factor of 16! If you look at the train function in train_drrn.py, you see that the agent is actually taking max_steps many steps in len(envs) many games in parallel. Max steps has the default value of 50000 as determined by the parse_args function, whereas the number of environments is defined in experiments/run_drrn.sh and it is 16.  So actually, the RL agent gets to play the game around 8000 times.

Is it surprising that an agent that plays a game 8000 times with the goal of maximizing the rewards does better than a random agent or a forgetful zero-shot LLM?

Bonus Flaw. The code is hard to read

This is not a methodological flaw, but the fact the code is hard to read makes it difficult to establish what the methodology is. There are no type-hints and most of the functions have no doc-strings. I asked some others for advice on understanding the RL agent, and they commented that this is common for research-grade code.

Positives

I am conscious that writing about the flaws paints a skewed picture of the work. There are of course many positives, too many to enumerate, but some highlights are:

0 comments

Comments sorted by top scores.