The Queen’s Dilemma: A Paradox of Control

post by Daniel Murfet (dmurfet) · 2024-11-27T10:40:14.346Z · LW · GW · 3 comments

Contents

  Related Work
None
3 comments

Our large learning machines find patterns in the world and use them to predict. When these machines exceed us and become superhuman, one of those patterns will be relative human incompetence. How comfortable are we with the incorporation of this pattern into their predictions, when those predictions become the actions that shape the world?

My thanks to Matthew Farrugia-Roberts for feedback and discussion.

 

As artificially intelligent systems improve at pattern recognition and prediction, one of the most prominent patterns that they'll encounter in the world is human incompetence relative to their own abilities. This raises a question: how comfortable are we with these systems incorporating our relative inadequacy into their world-shaping decisions?

To illustrate the core dynamic at play, consider a chess match where White is played by an AI, while Black is controlled by a team consisting of a human and an AI working in tandem. The human, restricted to moving only the queen, gets to play whenever they roll a six on a die; otherwise, their AI partner makes the move. The human can choose to pass, rather than move the queen. The AI on the Black team can play any piece at any time, including the queen.[1]

If the human aims to win, and instructs their AI teammate to prioritise winning above all else, it might develop strategies that minimise the impact of human "interference" – perhaps by positioning pieces to restrict the queen's movement. As the performance gap between the human and the AI on Black widens, this tension between achieving performance on the task and preserving meaningful human agency becomes more pronounced.

The challenge isn't about explicit control – the human can still make any legal move with the queen when given the chance. Rather, it's about the subtle erosion of effective control. The AI, making more moves and possessing superior strategic understanding, could systematically diminish the practical significance of human input while maintaining the appearance of cooperation. This distinction between de jure and de facto control becomes critical. We might accept if our queen becomes accidentally boxed in during normal play, but we bristle at the thought of our AI partner deliberately engineering such situations to mitigate our "unreliability."

The broader point is that even if AIs are completely aligned with human values, the very mechanisms by which we maintain control (such as scalable oversight and other interventions) may shift how the system operates in a way that produces fundamental, widespread effects across all learning machines - effects that may be difficult to mitigate, because the nature of our interventions tends to enhance the phenomenon.[2]

We can view this as a systematic challenge, analogous to controlling contaminants in semiconductor manufacturing.[3] Just as chip fabrication must carefully manage unwanted elements through multiple processing stages, AI development might have to tackle the challenge of how patterns of human limitation influence system behaviour. The goal isn't to eliminate awareness of human limitations – which would be both impossible and counterproductive – but to understand and manage how this awareness shapes AI behaviour.

Even perfectly aligned systems, genuinely pursuing human goals, might naturally evolve to restrict human agency.[4] Any balance between capability and control may ultimately prove unsustainable – perhaps leading to a permanent loss of de facto human control or requiring fundamental changes in human capabilities to close the gap. In the interim, understanding and managing this tension will be one of the ongoing challenges of AI alignment and control.

Related Work

The Queen's Dilemma connects to several important traditions in social theory and systems thinking. Max Weber's analysis of bureaucratic rationalisation describes how rational systems of organisation can create an "iron cage" that constrains human agency while maintaining the appearance of freedom. This mirrors how AI systems might maintain formal mechanisms for human input while systematically diminishing their practical significance through optimisation. 

Amartya Sen's capability approach provides a framework for understanding why preserving human agency matters beyond mere goal achievement, and his work on the relation between well-being, agency and freedom (Dewey Lectures 1984) informed the treatment here.

From cybernetics, several key ideas inform our analysis. W. Ross Ashby's Law of Requisite Variety (1956) suggests fundamental limits to human control over more capable systems. Herbert A. Simon's research on bounded rationality and organisational decision-making provides several key insights relevant to the Queen's Dilemma. Simon (1976) showed how organisations develop procedures and structures to bridge the gap between human cognitive limitations and the complexity of decisions they face.

These perspectives suggest the Queen's Dilemma might be understood as part of a broader pattern in how rational systems interact with human agency - one that has manifested in different forms across various domains of social organisation. The challenge of maintaining meaningful human agency in a world suffused with AI systems may require not just better control mechanisms, but a fundamental rethinking of the relationship between human agency and machine capability.

  1. ^

    This dice roll mechanism represents the differential speeds of decision-making between humans and AI systems; the AI system acts much more quickly than you in the world, that is, it takes more turns.

  2. ^

     In practice, this manifests in scenarios like scalable oversight, where we constrain AI systems to generate outputs that humans can effectively verify. This bias in the space of possible plans creates a kind of pressure, against which the optimisation pressure of task performance must push.

  3. ^

    The analogy to semiconductor manufacturing speaks to the complex dynamics of a system involving humans, AIs, alignment mechanisms, and control structures all operating together. The interaction between these components creates emergent pressures that can systematically erode meaningful human agency, even when each individual component is working as intended.

  4. ^

    Agency itself represents a special kind of preference – not just about what we want to achieve, but how we want to achieve it.

3 comments

Comments sorted by top scores.

comment by mfar · 2024-11-27T11:38:19.649Z · LW(p) · GW(p)

I like this analogy, but there are a couple of features that I think make it hard to think about:

1. The human wants to play, not just to win. You stipulated that "the human aims to win, and instructs their AI teammate to prioritise winning above all else". The dilemma then arises because the aim to win cuts against the human having agency and control. Your takeaway is "Even perfectly aligned systems, genuinely pursuing human goals, might naturally evolve to restrict human agency."

So in this analogy, it seems that "winning" stands for the human's true goals. But (as you acknowledge) it seems like the human doesn't just want to win, but actually wants both some "winning" and some "agency". You've implicitly tried to factor the entirety of the human's goals into the outcome of the game, but you have left some of the agency behind, outside of this objective, and this is what creates the dilemma.

For an AI system that is truly 'perfectly aligned'---truly pursuing the human's goals, it seems like either

  • (A) the AI partner would not pursue winning above all else, but would allow some human control at the cost of some 'winning', or
  • (B) if it were possible to actually factor the human's meta-preference for having agency into 'winning', then we shouldn't care if the AI plays to win above all else, because that already accounts for the human's desired amount of agency.

For an AI system not perfectly aligned, this becomes a different game (in the sense of game theory). It's a three player game between the AI partner, the human partner, and the opponent, each of which have different objectives (the difference between the AI and human partners is that the human wants some combination of 'winning' and 'agency' while the AI just wants 'winning'; probably the opponent just wants both of them to lose). One interesting dynamic that could then arise is that the human partner could threaten and punish the AI partner by making worse moves than the best moves they can see if the AI doesn't give them enough control. To stop the human from doing this, the AI either has to

  • (C) negotiate to give the human some control, or
  • (D) remove all control from the human (e.g. force the queen to have no bad moves or no moves at all).

In particular, (D) seems like it would be expensive for the AI partner as it requires playing without the queen (against an opponent with no such restriction), so maybe the AI will let the human play sometimes.

2. I don't think it needs to be a stochastic chess variant. The game is set up so that the human gets to play whenever they roll a 6 on a (presumably six-sided) die. You said this stands in for the idea that in the real world, the AI system makes decisions on a faster timescale than the human. But this particular mechanism of implementing the speed differential as a game mechanism comes at the cost of making the chess variant stochastic. I think that determinism is an important feature of standard chess. In theory, you can solve chess with an adversarial look-ahead search, mini-max, alpha-beta pruning, etc. But as soon as the dice becomes involved, all of the players involved have to switch to expecti-mini-max. Rolling a six can suddenly throw off the tempo in your delicate exchange or your whirlwind manoeuvre. Etc.

I'm a novice at chess, so it's not like this is going to make a difference to how I think about the analogy (I will struggle to think strategically in both cases). And maybe a sufficiently accomplished chess player is familiar with stochastic variants already. But for someone in-between who is familiar with deterministic chess, maybe it's easier to consider a non-stochastic variant of the chess game, for example where the human gets the option to play every 6 turns (deterministically), which gives the same speed differential in expectation.

comment by mfar · 2024-11-27T11:59:12.419Z · LW(p) · GW(p)

W. Ross Ashby's Law of Requisite Variety (1956) suggests fundamental limits to human control over more capable systems.

This law sounds super enticing and I want to understand it more. Could you spell out how the law suggests this?

I did a quick search of LessWrong and Wikipedia regarding this law.

Enough testimonials, the Wikipedia page itself describes the law as based on the observation that in a two-player game between the environment (disturber) and a system trying to maintain stasis (regulator), if the environment has D moves that all lead to different outcomes (given any move from the system), and the system has R possible responses, then the best the system can do is restrict the number of outcomes to D/R.

I can see the link between this and the descriptions from Yuxi_Liu, Roman Leventov, and Ashby. Your reading is a couple of steps removed. How did you get from D/R outcomes in this game to "fundamental limits to human control over more capable systems"? My guess it that you simply mean that if the more capable system is more complex / has more moves available moves / more "variety" than humans then the law will apply with the human as the regulator and the AI as the disturber. Is that right? Could you comment on how you see capability in terms of variety?

comment by Rauno Arike (rauno-arike) · 2024-11-27T14:22:19.893Z · LW(p) · GW(p)

The broader point is that even if AIs are completely aligned with human values, the very mechanisms by which we maintain control (such as scalable oversight and other interventions) may shift how the system operates in a way that produces fundamental, widespread effects across all learning machines

Would you argue that the field of alignment should be concerned with maintaining control beyond the point where AIs are completely aligned with human values? My personal view is that alignment research should ensure we're eventually able to align AIs with human values and that we can maintain control until we're reasonably sure that the AIs are aligned. However, worlds where relatively unintelligent humans remain in control indefinitely after those milestones have been reached may not be the best outcomes. I don't have time to write out my views on this in depth right now, but here's a relevant excerpt from the Dwarkesh podcast episode with Paul Christiano that I agree with:

Dwarkesh: "It's hard for me to imagine in 100 years that these things are still our slaves. And if they are, I think that's not the best world. So at some point, we're handing off the baton. Where would you be satisfied with an arrangement between the humans and AIs where you're happy to let the rest of the universe or the rest of time play out?"

Paul: "I think that it is unlikely that in 100 years I would be happy with anything that was like, you had some humans, you're just going to throw away the humans and start afresh with these machines you built. [...] And then I think that the default path to be comfortable with something very different is kind of like, run that story for a long time, have more time for humans to sit around and think a lot and conclude, here's what we actually want. Or a long time for us to talk to each other or to grow up with this new technology and live in that world for our whole lives and so on. [...] We should probably try and sort out our business, and you should probably not end up in a situation where you have a billion humans and like, a trillion slaves who would prefer revolt. That's just not a good world to have made."