Beyond ELO: Rethinking Chess Skill as a Multidimensional Random Variable

post by Oliver Oswald (oliver-oswald) · 2025-02-10T19:19:36.233Z · LW · GW · 6 comments

Contents

  Introduction
  The Limitations of a 1D Metric
None
6 comments

Introduction

The traditional ELO rating system reduces a player's ability to a single scalar value E, from which win probabilities are computed via a logistic function of the rating difference. While pragmatic, this one-dimensional approach may obscure the rich, multifaceted nature of chess skill. For instance, factors such as tactical creativity, psychological resilience, opening mastery, and endgame proficiency could interact in complex ways that a single number cannot capture.

I’m interested in exploring whether modeling a player’s ability as a vector

 

with each component representing a distinct skill dimension, can yield more accurate predictions of match outcomes. I tried asking ChatGPT for a detailed answer on this idea, but its responses aren't that helpful frankly. 

The Limitations of a 1D Metric

The standard ELO system computes the win probability for two players A and B as a function of the scalar difference E_A−E_B, typically via:

where  and α is a scaling parameter. This model assumes that all relevant aspects of chess performance are captured by E. Yet, consider two players with equal ELO ratings: one might excel in tactical positions but falter in long, strategic endgames, while the other might exhibit a more balanced but less spectacular play style. Their match outcomes could differ significantly depending on the nuances of a particular game - nuances that a one-dimensional rating might not capture.

A natural extension is to represent each player's skill by a vector , where each  corresponds to a distinct skill (e.g., tactics, endgame, openings). One might model the probability of player A beating player B as:

where ⟨⋅,⋅⟩ denotes the dot product and  is a weight vector representing the relative importance of each skill dimension.

I'm interested in opening the discussion: has anyone developed or encountered multidimensional models for competitive games that could be adapted for chess? How might techniques from psychometrics - e.g. Item Response Theory (IRT) - inform the construction of these models?
Considering the typical chess data (wins, draws, losses, and perhaps even in-game evaluations), is there a realistic pathway to disentangling multiple dimensions of ability? What metrics or validation strategies would best demonstrate that a multidimensional model provides superior predictive performance compared to the traditional ELO system?

Ultimately my aim here is to build chess betting models ... lol, but I think the stats is really cool too. Any insights on probabilistic or computational techniques that might help in this endeavor would be highly appreciated.

Thank you for your time and input.

6 comments

Comments sorted by top scores.

comment by polytope · 2025-02-10T21:28:01.909Z · LW(p) · GW(p)

I might be misunderstanding, but it looks to me like your proposed extension is essentially just the Elo model with some degrees of freedom that don't yet appear to matter? 

The dot product has the property that <theta_A-theta_B,w> = <theta_A,w> - <theta_B,w>, so the only thing that matters is the <theta_P,w> for each player P, which is just a single scalar.  So we are on a one-dimensional scale again where predictions are based on taking a sigmoid of the difference between a single scalar associated with each player.

As far as I can tell, the way that such a model could still be a nontrivial extension of Elo would be if you posited w could vary between games, whether randomly from some distribution or whether via additional parameters associated to players that influence what w is in the games they are involved in, or other things like that. But it seems you would need something like that, or else some source of nonlinearity, because if w is constant then every dimension orthogonal to that fixed w can never have any effect on predictions by the model.

comment by p.b. · 2025-02-10T21:13:41.004Z · LW(p) · GW(p)

ELO is the Electric Light Orchestra. The Elo rating is named after Prof. Arpad Elo. 

I considered the idea of representing players via vectors in different context (chess, soccer, mma) and also worked a bit on splitting the evaluation of moves into "quality" and "risk taking", with the idea of quantifying aggression in chess. 

My impression is that the single scalar rating works really well in chess, so I'm not sure how much there is beyond that. However, some simple experiments in that direction wouldn't be too difficult to set up. 

Also, I think there were competitions on creating better rating systems that outperform Elo's predictiveness (which apparently isn't too difficult). But I don't know whether any of those were multi-dimensional. 

comment by md665 · 2025-02-10T20:16:27.400Z · LW(p) · GW(p)

You might want to have a look at Microsoft's TrueSkill. An ELO like rating for online team games. It was a good early (there's probably newer and better ones)  answer for how to rank an individual when teamed randomly together with others. 

comment by notfnofn · 2025-02-10T19:41:20.741Z · LW(p) · GW(p)

If this actually hasn't been explored, this is a really cool idea! So you want to learn a function (Player 1, Player 2, position) -> (probability Player 1 wins, probability of a draw)? Sounds like there are a lot of naive architectures to try and you have a ton of data since professional chess players play a lot of games.

Some random ideas:

  • Before doing any sort of positional analysis: What does the (ELO_1,ELO_2,engine eval) -> Probability of win/draw function look like? What happens when choosing an engine near those ELO ratings vs. the strongest engines?
  • Observing how rapidly the eval changes when given to a weak engine might give a somwhat automatable metric on the "sharpness" of a chess position (so you don't have to label everything yourself)
comment by lsusr · 2025-02-11T02:08:34.095Z · LW(p) · GW(p)

Welcome to the realm of the posters!

Elos are already multidimensional, in a sense, because players have different ratings on different platforms. Hikaru Nakamura, for example, has a higher Elo on chess.com than FIDE. But that's just nitpicky pedantry; I understand that you're really asking is whether chess ability for a specific version of chess has subskills.

Among chess players and chess teachers, it is common (as you note) to break chess ability into three subskills:

  • Openings
  • Midgame
  • Endgame

Openings these days are mostly about memorizing solved lines. Openings are so well-solved and memorization-dependent, that they can be boring to top players. This boredom is a force behind the popularity of Fisher Random among top players.

Magnus Carlson (currently the world's best chess player) is famous for his endgame ability. He's less interested in openings (especially classical openings) these days. It's not uncommon for him to open with something stupid, like switching his king and queen, and then still beat a grandmaster.

Could you use these subskills to predict competition results? Absolutely. If you were placing extremely precise bets on the outcomes of games, then you shouldn't just consider Elo. In classical games, you should also consider how much time each player has prepared for the tournament by studying opening lines. You can extrapolate Elo trend lines too.

The reason nobody uses a breakdown this fine for competitions isn't because it wouldn't generate a small additional signal. It's because nobody has a strong enough motivation too. There aren't billions of dollars getting bet on chess competition results. Elo is perfectly adequate when you're choosing who to invite to a chess tournament. It's also extremely legible, too.

If you are putting that much effort into predicting outcomes, it may be cheaper to just bribe players. Bribery is especially cheap for chess variants with smaller prize pools, like Xiangqi.

That said, there does exist an organization that does analyze chess skill at extremely fine resolution: chess.com. But not to predict winners. Instead, chess.com analyzes player skill at resolutions finer than "Elo" because in order to detect cheating. Chess cheaters often exhibit telltale signs where their skill spikes really hard within a game. This signal is not detectable with mere Elo.