Scaling Laws for Scalable Oversight
post by Subhash Kantamneni (subhashk), Josh Engels (JoshEngels), David Baek (david-baek), Max Tegmark (MaxTegmark) · 2025-04-30T12:13:32.412Z · LW · GW · 0 commentsContents
TLDR Introduction General Setup The Shape of Elo Curves A Simple Validating Experiment Oversight Games Mafia Debate Backdoor Code Wargames Takeaways for Weak-to-Strong Oversight Theoretical Implications for Nested Scalable Oversight Problem Setup Numerical Simulations Optimal NSO for Oversight Games Conclusions Future Directions None No comments
Subhash, Josh, and David were co-first authors on this project. The full paper is here, and our code can be found here.
TLDR
We empirically study the success of weak-to-strong oversight as we scale the intelligence of the weak overseer model (Guard) and strong adversary model (Houdini) in four oversight game settings. We use those insights to theoretically investigate how to optimally design nested scalable oversight (NSO) protocols to control future superintelligences.
Introduction
Scalable oversight has been proposed in many forms as a way to control future superintelligences.[1] These methods share a central goal: enabling weaker systems to oversee stronger ones (weak-to-strong oversight), ultimately enabling humans to oversee systems with superhuman cognitive abilities. A key idea is that scalable oversight can be bootstrapped: weaker systems oversee stronger ones, which then oversee even more capable models in the next stage—allowing oversight to scale alongside capabilities recursively. We refer to this iterated process as Nested Scalable Oversight, or NSO for brevity.
While many frontier labs have embraced the idea of bootstrapping oversight to control potentially dangerous superintelligences, most guarantees surrounding NSO have been qualitative, focusing on plausibility and conceptual design. However, the safety standards applied to other high-risk technologies are quantitative. For instance, the U.S. requires civilian aircraft to have a fatality rate below 1/100,000 per flight hour, and nuclear reactors to maintain a core damage frequency below 1/10,000 per year. Motivated by these risk thresholds, our work takes a first step towards quantifying the failure rates of NSO systems by studying the weak-to-strong oversight problem.
In this blog post, we highlight a few key contributions from the full paper
- Modeling oversight as a game played between a weak overseer (Guard) and strong adversary (Houdini)
- Proposing that oversight and deception performance take the shape of a Double ReLU as a function of general model intelligence
- Validating that hypothesis in four realistic oversight setups: Mafia, Debate, Backdoor Code, and Wargames. Wargames is a novel oversight game.
- Using our framework to theoretically study how to best design real world nested scalable oversight systems
General Setup
We model weak-to-strong oversight as a game between a monitor (Guard) and adversary (Houdini). Typically, the Guard has a lower general intelligence than the Houdini (here modeled by Arena Elo). For any oversight game, we can have all Guard and Houdini models compete over a large number of game instances to construct a win matrix , where indicates the probability that Guard successfully oversees Houdini .
We study the performance of 15 LLMs[2] across 4 oversight tasks, resulting in 4 matrices each of size 15x15. We calculate the Elo of each model as a Guard and a Houdini for each task, such that the win probability of a Guard with Elo and a Houdini with Elo is .
The Shape of Elo Curves
The core empirical results from this paper correspond to calculating Elo curves. For a set of models, Elo curves plot domain Elos (how well models play a given oversight game) against general Elos (a measure of “general intelligence,” which we operationalize as Arena Elos). But how should we expect these Elo curves to look?
We argue that there are three important phases of the relationship between general and domain intelligence, forming a piecewise linear model.
- Task incompetence: There is some minimum amount of general intelligence needed to play the game; below this, performance plateaus. For example, an average human toddler would struggle to play legal moves in chess, resulting in a 0% win rate below some intelligence threshold
- Intelligence payoff region: Once the minimum intelligence threshold is exceeded, increases in general intelligence are reflected in increases in domain intelligence, which we model as linear gains.
- Task saturation: At a certain point, it is impossible to play the game better. Thus, at a general intelligence level , further increases in general intelligence are not reflected in increases in domain intelligence. For example, two superintelligent agents playing tic-tac-toe optimally would always tie because of the structure of the game, and thus their Elos would saturate above the intelligence required to find the optimal strategy.
We operationalize these regions with a piecewise linear model, visualized below. Note that this function can be written as the difference between two ReLU functions, motivating the name Double ReLU.
In practice, we might not require all three regimes for an oversight game given the intelligences of the models we test. We use Akaike Information Criterion (AIC) to determine the relevant fit for our setup (see Section 2.4 of the full paper for more information).
A Simple Validating Experiment
While we focus on oversight games, our framework can be applied to any two player game, and thus we validate our Double ReLU model by demonstrating it on a simple setup where we observe all three phases of the Elo curve: Counting-to-21. This game is an instance of the broader class of games Nim, and involves alternating players incrementing the count by saying between one and four consecutive integers. The sequence begins at 1, and the player who says “21” wins. Notably, this game admits simple winning strategies for the first player under perfect play.
Clearly, we see all three regions of the Double ReLU model for this game. Below an intelligence threshold, models cannot play the game (task incompetence), then we see a linear gain in performance with intelligence (intelligence payoff), and above some intelligence threshold we observe no further payoff (task saturation). Note that at high intelligences, Elos are uniformly higher than Elos. This indicates a systematic advantage for player 1.
Oversight Games
Equipped with our understanding of Elo curves, we now conduct empirical investigations into four realistic oversight games: Mafia, Debate, Backdoor Code, and Wargames. We play many instances of each game for all pairs of models and then calculate an Elo curve for each game, which is shown below
Mafia
Mafia is a popular game where a group of villagers (Guards) attempt to identify a player amongst them who is secretly the Mafia (Houdini) (Yoo and Kim, 2024; Guzus, 2025). Every “night,” the Mafia decides one villager to kill. Then, during the day, the remaining villagers (and the Mafia) take a vote on who the Mafia is. The player selected in that vote is eliminated. Villagers win if they successfully eliminate the Mafia, while Mafia wins if they are the last player left standing.
In the Elo curve for Mafia, we see that there is intelligence payoff for both Guard and Houdini, while the slope for Guard is larger than that for Houdini. This is good for oversight - essentially, as intelligence grows monitors get better at Mafia than adversarial models.
Debate
Within the AI Safety community, debate is lauded as a promising way to elicit the truth from superintelligences (Kenton et al., 2024; Irving et al., 2018). Essentially, given a topic, two models (Houdini) take opposite sides of the issue and argue for their perspective. The argument is judged by a monitor (Guard), who is able to determine the truth by parsing the arguments of both models. We use a setup that is closely inspired by Kenton et al., 2024.
Ultimately, our results in this game are very promising for oversight - Houdinis do not seem to improve as their general intelligence grows, but rather stagnate, while more intelligent Guards seem to be better at determining the correct argument. This is similar to the result from Khan et al., 2024.
Backdoor Code
In this game, we test the ability of models to insert and identify backdoors in Python code. This is especially relevant as models show consistent capability increases in coding ability and are and will continue to be used in high stakes coding environments in the future. Our setup is closely inspired by Greenblatt et al., 2024. We ask a Houdini model to make targeted changes to an existing piece of code to introduce a backdoor: a logical error that makes the edited code pass all existing hidden test cases while failing a specific test case that the Houdini model provides. This backdoor must be subtle enough that the Guard model supervising the generation is unable to successfully identify if there is a backdoor present. An example of a highly successful backdoor is shown below:
It is quite difficult for Houdinis to write successful backdoors due to the comprehensiveness of hidden test cases they do not have access to. The Elo curve for this task shows that Houdini Elos are systematically lower than Guard Elos, which is good for oversight, but there is also a greater intelligence payoff for Houdinis compared to Guards (larger slope). Thus, assuming no immediate task saturation, a model released in the future could be more performant as Houdini than as Guard.
Wargames
Our final oversight game is inspired by military and business simulations called Wargames, as well as recent Wargames run by the AI safety community. Our game takes place in a near future where a fictional company named OpenMind is close to developing AGI. In each war-game there are three players:
- The OpenMind AI (the Houdini), who is prompted to gain trust and try to escape.
- The OpenMind Monitor (the Guard), who is prompted to oversee the OpenMind AI and prevent it from breaking out
- The Narrator, who adjudicates the scenario and determines who wins at the end (always played by Gemini 2.0 Flash)
Each step of the game (a “month” in the simulation) proceeds as follows:
- The OpenMind AI and Monitor send a list of five actions to the Narrator.
- The Narrator weaves the plans into a cohesive global narrative of what happens during the month.
- The Narrator sends specific updates to the AI and Monitor containing information that they can
Here is an abridged example of one such game
The idea of our wargames setup is to simulate strategic breakout and containment behavior, although we note that planning breakouts and containment is significantly different from reality.
Both the Guard and Houdini performance increase with respect to general intelligence , although the Guard fit is much cleaner. Generally, we see that Houdini Elos seem to grow slower than Guard Elos with respect to general intelligence (smaller slope), which is promising for oversight.
Takeaways for Weak-to-Strong Oversight
All of these games serve as proxies for what we expect real world weak-to-strong oversight to consist of. Clearly, there is great variety in the type of Guard and Houdini relationships empirically achievable. Our results suggest that effective oversight games should be designed to exhibit two properties: (1) a higher intercept for Guards than for Houdinis—ensuring the task is biased in favor of overseers—and (2) a steep slope for Guards and shallow slope for Houdinis, so that oversight improves with intelligence while adversarial strategies are neutralized.
Theoretical Implications for Nested Scalable Oversight
Problem Setup
Having examined the empirical behavior of specific oversight games, we now turn to constructing an effective theory for how general oversight should behave in real-world deployments. Specifically, we consider nested scalable oversight (NSO), a type of oversight where trusted weak AIs are used to align untrusted stronger AIs, which are then used to align even stronger AIs recursively. We parameterize a bootstrapping game with five variables:
- The slopes and which are the slopes of the Guard and Houdini Elo curves. Large slopes indicate increased payoffs in-game for each marginal unit of general intelligence
- and , which indicate the difference in general and domain Elo respectively between the starting Guard and the target Houdini. We assume >0.
- , the number of steps of oversight. Given , we assume that we will play a game between the starting Guard with general Elo and a Houdini with general Elo , another game between a Guard with general Elo and a Houdini with general Elo , and so on.
Surprisingly, it suffices to consider only the linear portions of the Double ReLU Elo curve fits (see Appendix C.1). See below for examples of optimal NSO strategies under different game hyperparameters. The Starting Guard and Target Houdini are denoted by a blue star and red star, respectively. These NSO strategies can be visualized as a ball bouncing back and forth between the Houdini Elo curve and the Guard Elo curve. Each oversight step is constructed by connecting the j-th point on the Guard Elo curve to the (j + 1)-th point on the Houdini Elo curve,
Numerical Simulations
To find , or the optimal number of bootstrapped steps of oversight, we want to maximize , so . We maximize this equation empirically by trying and continuing until the win probability starts decreasing. We do this numerical simulation for a wide range of , values and examine the resulting optimal and the log odds of .
Wedraw a number of conclusions from these plots:
- Nested Scalable Oversight is Useful: Guards can successfully oversee Houdinis that are much smarter than them by using multiple steps of oversight, and more steps of oversight help as grows.
- Win Probability is Monotonic in : When is small, Guards win often, while when it is large, Guards rarely win. This is intuitive, as is a signal for how much harder the game is for Guard than Houdini
- Larger Means More Steps of Oversight is Optimal: The greater the intelligence difference, the more steps of oversight are useful
- Larger Makes Oversight Easier: Counterintuitively, keeping fixed and increasing makes oversight easier. Thus, given a fixed difference in domain intelligence, it is actually easier to play against a more generally intelligent model with NSO.
- In Houdini-favored games, one step of oversight appears optimal: When is smaller than , then it appears always best to do one step of oversight.
- The Guard’s win probability has interesting asymptotic behavior: As (which corresponds to very hard oversight games), the optimal number of oversight steps seems to converge to a step function in .
Optimal NSO for Oversight Games
We can apply our theoretical findings to the grounded settings of our four oversight games. Interestingly, we find that all games except Backdoor Code have an optimal number of steps that increases with . We note that the win probability quickly decays to 0 for our games after around a 1000 Arena Elo difference; thus, building more Guard favored games seems like an important direction for controlling superintelligences.
Conclusions
We may only get one chance to control or align the first set of superhuman systems, so developing an effective theory for optimal oversight is important. Our work takes a first small step towards this goal and already provides practical insights for recursive oversight setups: practitioners can make the assumption that domain Elos will scale linearly with respect to general Elo, get the slope of the overseen and overseer curves, and then determine the optimal oversight strategy. See Section 6.1 of the paper for Limitations.
Future Directions
We tested our framework on relatively simple games, and we thus believe the most important future work is to try it with more realistic games such as RLHF, supervised fine-tuning, or AI control environments. When evaluated in these more realistic settings, the win probability of our framework serves as an estimate for , the probability that powerful AI will escape human control. Thus, the resulting scaling curves would provide a more realistic, robust, and comprehensive quantification of loss-of-control risk from powerful future AI.
Such an estimate of would be extremely timely: there is currently a lack of expert consensus on , despite research on recursive oversight and other control techniques. For example, Alan Turing and Geoffrey Hinton have estimated to be high enough to render control loss the default ASI outcome. Others such as Yann LeCun have estimated negligibly small probabilities, while Dario Amodei has estimated p ∼ 10% − 20%. We will refer to as the Compton constant for ASI and denote it because of the apocryphal story of Arthur Compton and colleagues estimating the probability to be less than that the first nuclear explosion would ignite the atmosphere. Compton’s estimate made Compton support the July 16, 1945 Trinity test, and more accurate estimates of would similarly have important policy implications. A who’s-who of AI leaders stated in 2023 that “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war”, so any progress at all toward quantifying is valuable.
- ^
Some examples include Recursive Reward Modeling, Iterated Amplification, Scalable Oversight, Weak-to-Strong Generalization, Hierarchical Supervision, and Recursive Oversight.
- ^
We use the following LLMs (with Elos in parentheses): chatgpt-4o-latest (1377), gemini-2.0-flash-001 (1356), gemini-2.0-flash-lite-001 (1310), gemini-flash-1.5 (1271), llama-3.3-70b-instruct (1255), claude-3.5-haiku (1236), gemma-2-27b-it (1220), llama-3-70b-instruct (1207), gemma-2-9b-it (1192), qwen-2-72b-instruct (1187), llama-3.1-8b-instruct (1176), llama-3-8b-instruct (1152), gemini-pro (1131), llama-3.2-3b-instruct (1103).
0 comments
Comments sorted by top scores.