What is the most impressive game LLMs can play well?

post by Cole Wyeth (Amyr) · 2025-01-08T19:38:18.530Z · LW · GW · 15 comments

This is a question post.

Contents

  Answers
    5 Martin Randall
    1 ProgramCrafter
    1 jolign
None
15 comments

Epistemic status: This is an off-the-cuff question.

~5 years ago there was a lot of exciting progress on game playing through reinforcement learning (RL). Now we have basically switched paradigms, pretraining massive LLMs on ~the internet and then apparently doing some really trivial unsophisticated RL on top of that - this is successful and highly popular because interacting with LLMs is pretty awesome (at least if you haven't done it before) and they "feel" a lot more like A.G.I. Probably there's somewhat more commercial use as well via code completion (and some would say many other tasks, personally not really convinced - generative image/video models will certainly be profitable though). There's also a sense in which they are clearly more general - e.g. one RL algorithm may learn many games but there's typically an instance per game not one integrated agent. You can just ask an LLM in context to play some games.

However, I've been following moderately closely and I can't seem to think of any examples where LLMs really pushed the state of the art in narrow game playing  - how much have LLMs contributed to RL research? For instance, will adding o3 to the stack easily stomp on previous Starcraft / go / chess agents? 

Answers

answer by Martin Randall · 2025-01-09T14:49:26.088Z · LW(p) · GW(p)

Diplomacy AI by Meta [LW · GW] is a clear example of how adding LLMs can improve narrow game playing. Most multiplayer games with communication will benefit in the same way.

comment by Cole Wyeth (Amyr) · 2025-01-09T15:59:50.945Z · LW(p) · GW(p)

Yes, after asking the question I realized Diplomacy would be the most likely answer. I don't find it very satisfying though because it is a text/vibes based game - it wouldn't have been possible to approach effectively at all without building some kind of chatbot, so it's exactly the type of game I'd expect LLMs to make progress on even without pushing the frontier on strategy/planning. 

answer by ProgramCrafter · 2025-01-16T23:31:26.113Z · LW(p) · GW(p)

In StarCraft II, adding LLMs (to do/aid game-time thinking) will not help the agent in any way, I believe. That happens because inference has a quite large latency, especially as most of prompt changes with all the units moving, so tactical moves are out; strategic questions "what is the other player building" and "how many units do they already have" are better answered by card-counting counting visible units and inferring what's the proportion of remaining resources (or scouting if possible).

I guess it is possible that bots' algorithms are improved with LLMs but that requires a high-quality insight; not convinced that o1 or o3 give such insights.

comment by gwern · 2025-01-17T03:07:28.233Z · LW(p) · GW(p)

Ma et al 2023 is relevant here.

Replies from: programcrafter
comment by ProgramCrafter (programcrafter) · 2025-01-18T10:53:48.616Z · LW(p) · GW(p)

That article is suspiciously scarce on what microcontrols units... well, glory to LLMs for decent macro management then! (Though I believe that capability is still easier to get without text neural networks.)

15 comments

Comments sorted by top scores.

comment by Archimedes · 2025-01-09T04:51:21.197Z · LW(p) · GW(p)

Related question: What is the least impressive game current LLMs struggle with?

I’ve heard they’re pretty bad at Tic Tac Toe.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2025-01-15T10:59:20.435Z · LW(p) · GW(p)

Relevant: Manifold market about LLM chess

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-01-15T15:16:14.913Z · LW(p) · GW(p)

Interesting, the prices seemed reasonable overall though I traded the later dates down a little bit because if LLMs haven't won be 2030 the paradigm is probably limited (IMO they hadn't priced in that update). 

I suppose that it's a slightly "unfair" comparison because chess engines are very narrow and humans can't beat them either. How do LLMs compare to top human chess players?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2025-01-16T14:49:41.062Z · LW(p) · GW(p)

Apparently someone let LLMs play against the random policy and for most of them, most games end in a draw. Seems like o1-preview is the best of those tested, managing to win 47% of the time.

Replies from: gwern, Amyr
comment by gwern · 2025-01-17T03:09:41.456Z · LW(p) · GW(p)

Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.

Replies from: Amyr, vanessa-kosoy
comment by Cole Wyeth (Amyr) · 2025-01-17T14:42:28.937Z · LW(p) · GW(p)

This seems possible - according to this article almost every model got crushed by the easiest Stockfish: https://dynomight.net/chess/
But at the end he links to his second attempt which experimented with fine tuning and prompting, eventually getting decent performance against weak Stockfish. Actually he notes that lists of legal moves are actively harmful, which may partially explain the original example with random agents. 

A cursory glance at publications on the topic seems to indicate that LLMs can make valid moves and somehow represent the board state (which seems to follow), but are still weak players even after significant effort designing prompts.

Can you share any more definitive evidence? 

comment by Vanessa Kosoy (vanessa-kosoy) · 2025-01-17T10:09:36.530Z · LW(p) · GW(p)

Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?

Replies from: gwern
comment by gwern · 2025-01-20T20:17:59.986Z · LW(p) · GW(p)

Yes.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-01-20T20:36:02.661Z · LW(p) · GW(p)

This seems more plausible post hoc. There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set, and the prompt suggests strong play. 

Replies from: gwern
comment by gwern · 2025-01-20T21:01:35.641Z · LW(p) · GW(p)

There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set

I wouldn't think that. I'm not sure I've seen a random-play transcript of chess in my life. (I wonder how long those games would have to be for random moves to end in checkmate?)

the prompt suggests strong play.

Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo ("only games with players of Elo 1800 or higher were included in pretraining"), in standard behavior-cloning fashion.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-01-20T23:48:25.363Z · LW(p) · GW(p)

I don't know, I almost instantly found a transcript of a human stomping a random agent on reddit:

https://www.reddit.com/r/chess/comments/2rv7fr/randomness_vs_strategy/

This sort of thing probably would have been scraped?

I was thinking that plenty would appear as the only baseline a teenage amateur RL enthusiast might beat before getting bored, but I haven't found any examples of anyone actually posting such transcripts after a few minutes of effort so maybe you're right.

Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo, in standard behavior-cloning fashion.

Chess-specific training sets won't contain a lot of random play.

I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?

Replies from: gwern
comment by gwern · 2025-01-21T00:16:10.928Z · LW(p) · GW(p)

A human player beating a random player isn't two random players.

I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?

Well, there's the DM bullet-chess GPT as a drastic proof of concept. If you believe that LLMs cannot learn to play chess, you have to explain how things like that work.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-01-21T01:04:39.013Z · LW(p) · GW(p)

A random player against a good player is exactly what we’re looking for right? If all transcripts with one random player had two random players then LLMs should play randomly when their opponents play randomly, but if most transcripts with a random player have it getting stomped by a superior algorithm that’s what we’d expect from base models (and we should be able to elicit it more reliably with careful prompting). 

I see no reason that transformers can’t learn to play chess (or any other reasonable game) if they’re carefully trained on board state evaluations etc. This is essentially policy distillation (from a glance at the abstract). What I’m interested in is whether LLMs have absorbed enough general reasoning ability that they can learn to play chess the hard way, like humans do - by understanding the rules and thinking it through zero-shot. Or at least transfer some of that generality to performing better at chess than would be expected (since they in fact have the advantage of absorbing many games during training and don’t have to learn entirely in context). I’m trying to get at that question by investigating how LLMs do at chess - the performance of custom trained transformers isn’t exactly a crux, though it is somewhat interesting. 

comment by Cole Wyeth (Amyr) · 2025-01-16T19:01:54.812Z · LW(p) · GW(p)

This is a pretty strong update against LLMs for me. I would have expected them to perform okay against a random model given free access to the board state and list of legal moves. I suspect I could probably win  blind (and I am a serious player, certainly others can win multiple blind games at once) so this is not entirely a perception issue. On the other hand, o1 is certainly getting some traction, which often precedes steady improvement (based on the last couple of years). But... like, it's basically doing a super overpriced tree search. I'm guessing a tree search to depth 3 with a naive heuristic is already enough to beat a random player, so I'm not convinced that the LLM is lifting any weight here.