Posts

Comments

Comment by Yoel Cabo (yoel-cabo) on Recent AI model progress feels mostly like bullshit · 2025-04-07T12:17:47.263Z · LW · GW

I think the "number of actions" axis is key here.

This post explains it well: https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon. I’ve been watching Claude Plays Pokémon and chatting with the Twitch folks. The post matches my experience.

There’s plenty of room for improvement in prompt design and tooling, but Claude is still far behind the performance of my 7-year-old (unfair comparison, I know). So I agree with OP, this is an excellent benchmark to watch:

  • It’s not saturated yet.
  • It tests core capabilities for agentic behavior. If an AI can’t beat Pokémon, it can’t replace a programmer.
  • It gives a clear qualitative feel for competence—just watch five minutes.
  • It’s non-specialized and anyone can evaluate it and have a shared understanding (unlike Cursor, which requires coding experience).

 

And once LLMs beat Pokemon Red, I'll personally will want to see them beat other games as well to make sure the agentic capabilities are generalizing.