zchuang's Shortform
post by zchuang · 2025-03-18T10:29:25.634Z · LW · GW · 3 commentsContents
3 comments
3 comments
Comments sorted by top scores.
comment by zchuang · 2025-03-18T10:29:25.633Z · LW(p) · GW(p)
The Claude plays pokemon discourse makes me feel out of touch because I don't understand how this is not an update people had with chess playing LLMs. If anything, I think the data contamination is worse for pokemon because of gamefaq and other guides online.
I think people are mistaking benchmarks and trailheads. Maths benchmarks are important because they theoretically speed up AI research and benchmark to useful intelligence. Claude playing pokemon doesn't tell you much because it's not a map on intelligence nor of generalisation.
Replies from: Vladimir_Nesov, oliver-daniels-koch↑ comment by Vladimir_Nesov · 2025-03-18T14:57:36.107Z · LW(p) · GW(p)
It's a good unhobbling eval, because it's a task that should be easy for current frontier LLMs at System 1 level, and they only fail because some basic memory/adaptation faculties that humans have are outright missing for AIs right now. No longer failing will be a milestone in no longer obviously missing such features (assuming the improvements are to general management of very long context, and not something overly eval-specific).
↑ comment by Oliver Daniels (oliver-daniels-koch) · 2025-03-18T12:41:37.940Z · LW(p) · GW(p)
Map on "coherence over long time horizon" / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves)