Seeking feedback on "MAD Chairs: A new tool to evaluate AI"

post by Chris Santos-Lang (chris-santos-lang) · 2025-04-02T03:04:43.182Z · LW · GW · 0 comments

This is a link post for https://arxiv.org/abs/2503.20986v2

Contents

No comments

I have written a paper about a previously undiscussed fundamental game (think Coordination game and Prisoner's Dilemma) which Chris Homan and I are calling "MAD Chairs". It is relevant to AI safety because MAD Chairs is the game that would be played by AI displacing humanity at the top of a caste system. We evaluated current LLMs and found both that they play MAD Chairs differently from each other and that none would maintain grandmaster status. Anecdotal evidence suggests that we normally play caste strategies in real-world manifestations of MAD Chairs, and thus that AI alignment to our current norms for MAD Chairs (i.e., AI treating us as poorly as we treat each other) would be unsafe for us. The main result of the paper is a proof that caste strategies happen to be suboptimal, and thus that aligning AI to sustainable grandmaster strategies--at least in real-world manifestations of MAD Chairs--would be safer.

I will present the paper at an AAMAS workshop in Detroit on May 20, then it will be revised to incorporate the feedback received there. If anyone in the LessWrong community has insight to contribute, but will not attend the workshop, please share with me before May 20, and I will do my best to bring it to the workshop. A preprint can be found at the link (https://arxiv.org/abs/2503.20986v2).

0 comments

Comments sorted by top scores.