Chess - "Elo" of random play?

post by Shankar Sivarajan (shankar-sivarajan) · 2025-05-07T02:18:18.498Z · LW · GW · 10 comments

This is a link post for

This is a question post.

Contents

  Answers
    17 winstonBosan
    5 polytope
    5 Lao Mein
    3 Omne
None
10 comments

I'm interested in a measure of  chess-playing ability that doesn't depend on human players, and while perfect play would be the ideal reference, as long as chess remains unsolved, the other end of the spectrum, the engine whose algorithm is "list all legal moves and uniformly at random pick one of them," seems the natural choice.

 I read that the formula for Elo rating  is scaled so that, with some assumptions of transitivity of winning odds,   so it's trivial to convert probability to Elo rating, and my question is roughly equivalent to "What is the probability of victory of random play against, say, Stockfish 17?"  If the Elo is close to 0[1], that makes  the probability around  (estimating Stockfish 17's Elo to be 3600). Eyeballing the y-intercept of this plot of lc0's Elo vs. number of games of self-play, it looks something like 150–300 (lots of uncertainty). Does that sound reasonable?

I understand the probability of victory against the best modern engines is probably too small to accurately measure directly, so you would have to construct weaker/noisier engines and step down in increments of a few hundred Elo. Has anyone done this?

  1. ^

    I don't see an a priori reason to expect it would be.

Answers

answer by winstonBosan · 2025-05-07T14:27:44.158Z · LW(p) · GW(p)

TLDR: random is roughly 477 elo. (See details and caveats below)

See link, where someone matches weird chess algos against various dilutions of stockfish (stockfish at lvl n with x% of random move mixed in). So this elo is of course, relative to various stockfish dilutions at the time of recording.
https://www.youtube.com/watch?v=DpXy041BIlA

comment by Shankar Sivarajan (shankar-sivarajan) · 2025-05-07T14:43:05.358Z · LW(p) · GW(p)

various dilutions of stockfish

This is precisely what I was looking for! Thanks. (I was actually imaging various amounts of noise being added to the weights of the evaluation neural net, but this is probably close enough.)

comment by Veedrac · 2025-05-08T18:55:14.788Z · LW(p) · GW(p)

Consider as a near-limiting case, imagine an engine that before the game began flipped a coin. On heads, it plays the game as Stockfish. On tails, it plays the game as Worstfish. What is this engine's ELO?

I'm short of time to go into details but this should help illustrate why one should be careful treating ELO as a well defined space rather than a local approximation that's empirically useful for computationally-limited players.

answer by polytope · 2025-05-08T12:41:04.623Z · LW(p) · GW(p)

One thing that's worth keeping in mind with exercises like this is that while you can do this in various ways and get some answers, the answers you get may depend nontrivially on how you construct the intermediate ladder of opponents. 

For example, attempts to calibrate human and computer Elo ratings scales often do place top computers around the 3500ish area, and one of the other answers given has indicated by a particular ladder of intermediates that random would then be at 400-500 Elo given that. But there are also human players who are genuinely rated 400-500 Elo on servers whose Elo ratings are also approximately transitively self-consistent within that server. These players can still play Chess - e.g. know how pieces move, and can see captures and move pieces to execute those captures vastly better than chance, etc. I would not be surprised to see such a player consistently destroy a uniform random Chess player. Random play is really, really bad. So there's a good chance here that we would see a significant nonlinearity/nontransitivity in Elo ratings, such that there isn't any one consistent rating that we can assign to random play relative to Stockfish.

A good way of framing this conceptually is to say that Elo is NOT a fundamental truth about reality, rather it's an imperfect model that we as humans invented that depending on the situation may work anywhere from poorly to okay to amazingly good at approximating an underlying reality.

In particular, the Elo model makes a very strong "linearity-like" assumption: that if A beats B with expected odds a:b, and B beats C with expected odds b:c, then A will beat C with expected odds of precisely a:c. (where draws are treated as a half point of each player beating the other, i.e. mathematically equivalent in expectation to if you were to resolve all draws by fair coin flip to determine the winner), and then given the way rating is defined from there, this linearity in odds then implies that the expected score between players follows precisely a sigmoid function f(x) = 1/(1+exp(-x)) of their rating difference up to constant scaling.

Almost any real situation will violate these assumptions at least a little (and even mathematically ideal artificial games will violate it, e.g. a game where players have a fixed mean and variance and compete by sampling from different gaussians to see whose number is higher will violate this assumption!). But in many cases of skill-based competition this works quite well, and there are various ways to justify and explain why this approximation does work pretty well when it does!

But even in games/domains where Elo does approximate realistic player pools amazingly well, it quite commonly stops doing as well at the extremes. For example, two common cases where this happens can include:

  • When the pool of players (particularly bots) being compared are all relatively close to being optimal
  • When the pool of players being compared cover an extremely wide range of "skill levels".

The first case can happen when the near-optimal players have some persistent tendencies as to mistakes they still make as well as sharp preferences for various lines. Then you no longer have law-of-large-numbers effects (too few mistakes per game) and also no poisson-like smoothness in the arrival rate of mistakes (mistakes aren't well-modeled as having an "arrival rate" if they're sufficiently consistent to a line x bot combination) and the Elo model simply stops being a good model of reality. I've seen this empirically be the case on the 9x9 computer go server (9x9 "CGOS") with a bot equilibrating at one point to be a couple hundred Elo lower than a bot no longer running that it should have been head-to-head equal or stronger than, due to different transitive opponents. 

The second case, the one relevant here, can happen because there's no particular reason to expect that a game will actually have tails that precisely match that of a sigmoid function f(x) = 1/(1+exp(-x)) in expected score in the extreme. Depending the actual tails between different pairs of players of increasingly large ratings differences, particularly whether it tends to be thinner or heavier than exp(-x) in given conditions, when you then try to measure large ratings differences via many transitive steps of intermediate opponents, you then will get different answers depending on the composition of those intermediate players and how many and how big of steps you take.

It's not surprising when models that are just useful approximations of reality (i.e. "the map, not the territory") start breaking down at extremes. It can be still worthwhile doing things like this to build intuition or even just for fun and see what numbers you get! While doing so, my personal tendency in such cases would still be to emphasize that at the extremes of questions like "what is the Elo of perfect play" or "what is the Elo of random play", the numbers you do get can start to be answers that have a lot to do with one's models and methodologies rather than answers that reflect an underlying reality accurately.

answer by Lao Mein · 2025-05-07T12:32:36.852Z · LW(p) · GW(p)

Consider the relatively simple 2 rooks endgame. 

White has 2 rooks and makes random moves. Black only has a king. We'll ignore the white king for now, but I'm pretty sure it makes things worse for White.

Each rook has 14 possible moves, for a total of 28 rook moves. One of those 28 progresses towards a winning state, one regresses it, and 3 blunder a rook.

The Markov Chain for this converges at a rook blunder being ~435x more likely than a checkmate. If black tries to hunt down the rooks, this gets even worse.

Thus, an impossibly big advantage against Stockfish is still extremely unlikely to convert into a win.

I don't think the other endgames are massively different - the number of possible moves and blunder-moves are roughly the same, although progression is more likely to be completely lost to a random move.

My question is: does this imply that a double-rook endgame is more likely to checkmate after losing a rook than before? 

comment by Lao Mein (derpherpize) · 2025-05-12T01:11:23.306Z · LW(p) · GW(p)

Turns out, the answer to my question is a clear "no". The white King has up to 7 moves, and the rook 14. In about half of the positions, the black King is in diagonal contact with the white Rook. This means that 4/14 rook moves and all but 2 King moves blunder the rook. In addition, white requires up to 16 moves to get to checkmate. The only positive factor is the slightly lower number of moves (~20 vs ~32), but a much higher proportion of moves for the single rook endgame is a blunder for white, and that more than cancels out any upside.

answer by Omne · 2025-05-07T05:02:39.409Z · LW(p) · GW(p)

 Has anyone done this?

Yes. Computer chess enthusiasts have created a rating list known as CCRL that covers chess engines ranging from the strongest ones we have to somewhat simple amateur-developed chess engines. Most of the engines on that list are open-source.

Two catches:

  • The weakest engines on the list still exceed random play by a substantial margin.
    • Implementing some chess engines to fill that gap won't be difficult, since libraries are available for move generation, and you don't need efficient search techniques if you're trying to make a weak chess engine.
  • CCRL games are typically played from openings that give a small advantage (usually around +1.0) to White, rather than directly from the starting position. Two games are played for each opening, with the engines switching colors between the two games. Unequal openings are used because in high level engine play, playing games from the starting position will result in too many draws.

If you have questions about basic chess programming, please feel free to ask me. I recently started work on a chess engine (as a personal project) and thus am familiar with some of the concepts.

10 comments

Comments sorted by top scores.

comment by cosmobobak · 2025-05-07T10:17:10.103Z · LW(p) · GW(p)

Speaking as someone who works on a very strong chess program (much stronger than AlphaZero, a good chunk weaker than Stockfish), random play is incredibly weak. There are likely a double-digit number of +400 elo / 95% winrate jumps to be made between random play and anything resembling play that is "actually trying to win".

The more germane point to your question, however, is that Chess is a draw. From the starting position, top programs will draw each other. The answer to the question "What is the probability of victory of random play against Stockfish 17?" is bounded from above by the answer to the question "What is the probability of victory of Stockfish 17 against Stockfish 17?" - and the answer to the second question is that it is actually very low - I would say less than 1%.

This is why all modern Chess engine competitions use unbalanced opening books. Positions harvested from either random generation or human opening databases are filtered to early plies where there is engine-consensus that the ratio p(advantaged side wins) : p(disadvantaged side draws) is as close to even as possible (which cashes out to an evaluation as close to +1.00 as possible). Games between the two players are then "paired" - one game is played with Player 1 as the advantaged side and Player 2 as the disadvantaged side, then the game is played again with the colours swapped. Games are never played in isolation, only in pairs (for fairness).

In this revised game - "Game-Pair Sampled Unbalanced Opening Chess", we can actually detect differences in strength between programs.

I'm not sure how helpful this is to your goal of constructing effective measures for strength, but I felt it would be useful to explain the state of the art.

Replies from: shankar-sivarajan
comment by Shankar Sivarajan (shankar-sivarajan) · 2025-05-07T13:55:13.072Z · LW(p) · GW(p)

Thanks. Yeah, I guess chess has been (weakly) solved and that means you need a more powerful technique for probing differences. Follow up: around what rating do engines gain the ability to force a draw from the starting position? (I understand this will only be a heuristic for the real question of "which engines possess an optimal strategy form the standard starting position?")

Replies from: cosmobobak, tslarm
comment by cosmobobak · 2025-05-08T10:36:20.482Z · LW(p) · GW(p)

My guess is somewhere in the 3200-3400 range, but this isn't something I've experimented with in detail.

comment by tslarm · 2025-05-07T14:07:51.762Z · LW(p) · GW(p)

Forgive the nitpick, but I think the standard definition of "weakly solved" requires known-optimal strategies from the starting position, which don't exist for chess. It's still not known for sure that chess is a draw -- it just looks very likely.

Replies from: shankar-sivarajan
comment by Shankar Sivarajan (shankar-sivarajan) · 2025-05-07T15:37:52.511Z · LW(p) · GW(p)

Agreed. This is a deliberate abuse of mathematical terminology, substituting any notion of "proof" with "looks true from experimental evidence." 

comment by Yair Halberstadt (yair-halberstadt) · 2025-05-07T04:32:11.854Z · LW(p) · GW(p)

Stockfish is incredibly strong at exploiting small mistakes. I'm going to assume that on average if you make anything other than the top 5 moves at any point in a game, stockfish will win, no matter what you do after.

An average game is about 40 turns, and there's about 20 valid moves each turn.

So that puts a upper limit on success of 1 in 4^40.

Similarly if you pick the best move at all times, you'll win, putting a lower limit at 1 in 20^40

Making some assumptions about how many best moves you need to counteract a poor but not fatal move, you could try to estimate something more accurate in this range.

Replies from: yair-halberstadt
comment by Yair Halberstadt (yair-halberstadt) · 2025-05-07T04:58:31.217Z · LW(p) · GW(p)

If you want a more accurate estimate of how often top chess engines pick the theoretical best move, you could compare Leelachess and stockfish. These are very close to each other ELO wise but have very different engines and styles of play. So you could look at how often they agree on the best move, and assume that both have some distribution where they pick their best moves from the the true move ranking, and then use that to calculate parameters of the distribution.

comment by brambleboy · 2025-05-07T05:32:10.764Z · LW(p) · GW(p)

A measure of chess ability that doesn't depend on human players is average centipawn loss: how much worse the player's moves are than the engine's moves when evaluated. (This measure depends on the engine used, of course.)

Replies from: shankar-sivarajan
comment by Shankar Sivarajan (shankar-sivarajan) · 2025-05-07T15:35:52.891Z · LW(p) · GW(p)

Thanks. This looks like a measure based on "perfect play" as a reference, with the strongest engine you can find serving as a proxy for it. I expect it works well in the typical range of human play, but I think it'll be a poor metric at the extreme ends of the range, near random and at close-to (and definitely at better-than) engine level.