GPT-4 is bad at strategic thinking

post by Christopher King (christopher-king) · 2023-03-27T15:11:47.448Z · LW · GW · 8 comments

Contents

9 comments

GPT-4 is known to pretty good at chess (see I played chess against ChatGPT-4 and lost! for one example). However, GPT-4 does not seem to be very good at strategic reasoning in general (it only really can do it if there is a greedy search algorithm).

I tried Hex and Connect4, it failed at both despite being able to explain the rules and even display the board with ASCII art. I was wondering if maybe it just has bad spatial reasoning, so I tried puzzles in natural language based on logical constraints. It failed these as well unless they were quite simple.

I even made a variant of chess up on the spot where the goal is to get any piece to the bank rank instead of capturing the King. It didn't stop me from "sacking" my queen by moving it to the bank rank as soon as their was a gap. So if it has an internal model of chess, it didn't figure out how to apply it to new objectives.

So I think GPT-4 must've learned a rudimentary chess engine; it is not applying general strategic reasoning to chess.

This doesn't necessarily mean GPT-4 can't be agentic, but it does suggest it is either a narrow one or a dumb one (or it's hiding its abilities).

8 comments

Comments sorted by top scores.

comment by roystgnr · 2023-03-27T19:30:42.402Z · LW(p) · GW(p)

It's hard to apply general strategic reasoning to anything in a single forward pass, isn't it?  If your LLM has to come up with an answer that begins with the next token, you'd better hope the next token is right.  IIRC this is the popular explanation for why LLM output seems to be so much better when you just add something like "Let's think step by step" to the prompt.

Is anyone trying to incorporate this effect into LLM training yet?  Add an "I'm thinking" and an "I'm done thinking" to the output token set, and only have the main "predict the next token in a way that matches the training data" loss function grade on tokens that aren't in between those brackets.  Then when you hit "What is 45235 + 259719? 304954" in the training set, optimization doesn't have to discourage multi-step reasoning to reproduce that, because "<thinking>5+9=14, so we carry the 1" ... "</thinking>304954" is still worth just as much as an ex nihilo "304954".  Chess algorithms could do a brief tree search before outputting their final decision.

Add whatever regularization is needed to keep the "train of thought" in English rather than in whatever-cypher-the-optimizer-hits-on, and this would be an increase in safety, not just in capabilities.  The more "internal reasoning" is human-readable text rather than maybe-a-posteriori-interpretable activation patterns, the better.  You could even expose it to the end user: ask ChatGPT a question and you get a succinct answer, click on the "expose thoughts" button and you get the chain of support for the answer. 

Replies from: caleb-biddulph
comment by CBiddulph (caleb-biddulph) · 2023-03-28T04:50:07.856Z · LW(p) · GW(p)

This is also basically an idea I had - I actually made a system design and started coding it, but haven't made much progress due to lack of motivation... Seems like it should work, though

comment by harfe · 2023-03-27T15:31:59.828Z · LW(p) · GW(p)

Certain kinds of "thinking ahead" is difficult to do within 1 forward pass. Not impossible, and GPT-4 likely does a lot of thinking ahead within 1 forward pass.

If you have lots of training data on a game, you often can do well without thinking ahead much. But for a novel game, you have to mentally simulate a lot of options how the game could continue. For example, in Connect4, if you consider all your moves and all possible responses, these are 49 possible game states you need to consider. But with experience in this game, you learn to only consider a few of these 49 options.

Maybe this is a reason why GPT-4 is not so good when playing mostly novel strategy games.

comment by nem · 2023-03-27T20:16:13.509Z · LW(p) · GW(p)

Small nitpick with the vocabulary here. There is a difference between 'strategic' and 'tactical', which is particularly poignant in chess. Tactics is basically your ability to calculate and figure out puzzles. Finding a mate in 5 would be tactical. Strategy relates to things too big to calculate. For instance, creating certain pawn structures that you suspect will give you an advantage in a wide variety of likely scenarios, or placing a bishop in such a way that an opponent must play more defensively.

I wasn't really sure which you were referring to here; it seems that you simply mean that GPT isn't very good at playing strategy games in general; ie it's bad at strategy AND tactics. My guess is that GPT is actually far better at strategy; it might have an okay understanding of what board state looks good and bad, but no consistent ability to run any sort of minimax to find a good move, even one turn ahead.

Replies from: christopher-king
comment by Christopher King (christopher-king) · 2023-03-27T20:31:06.710Z · LW(p) · GW(p)

It didn't even seem to understand what the goals of any of the games were, despite being able to explain it in natural language. So it wasn't even at a point I could test a strategy v.s. tactics distinction.

Replies from: nem
comment by nem · 2023-03-27T20:40:53.642Z · LW(p) · GW(p)

Ha, no kidding. Honestly, it can't even play chess. I just tried to play it, and asked it to draw the board state after each move. It started breaking on move 3, and deleted its own king. I guess I win? Here was its last output.

For my move, I'll play Kxf8:

8  r n b q . b . .
7  p p p p . p p p
6  . . . . . n . .
5  . . . . p . . .
4  . . . . . . . .
3  . P . . . . . .
2  P . P P P P P P
1  R N . Q K B N R    
     a b c d e f g h

Replies from: christopher-king
comment by Christopher King (christopher-king) · 2023-03-27T20:56:48.280Z · LW(p) · GW(p)

Apparently GPT-4 is only good at chess if it tell it not to explain anything (or show the board as it turns out). This also suggests that the chess part is separate from the rest.

comment by WilliamKiely · 2023-03-27T23:19:44.362Z · LW(p) · GW(p)

Feedback on the title: I don't like the title because it is binary.

Saying X is "good" or "bad" at something isn't very informative.

There are many degrees of goodness. Was it worse than you thought it would be before you played around with it a bit more? Was it worse than some popular article or tweet made you think? Was it worse than some relevant standard?