AI Control Methods Literature Review

post by Ram Potham (ram-potham) · 2025-04-18T21:15:34.682Z · LW · GW · 1 comments

Contents

  1. Monitoring
    Trusted Monitoring (and editing)
    Untrusted Monitoring
    Human Auditing
    Guardrails
    Reward Hacking Monitoring (incl. Chain-of-Thought Monitoring)
    Activation-Based Safety Interventions (incl. Circuit Breakers)
    Monitoring LLM Agents for Sequentially Contextual Harm
  2. Control Evaluations
    Subversion Strategy Evaluation
    AI Control Levels (ACLs)
    Control Evaluations in Multi-Step Agent Environments
    Resample Protocols Evaluation
  3. Guaranteed Safe AI
    Formal Safety Specifications
    Verifiers (Formal Verification & Proof-Carrying Code)
    Sandboxing
      AI Control Safety Cases
    Incident Reporting, Escalation, and Shutdown Processes
  4. Capability Control
    Mitigating Steganography
    Restricting Information Access and Communication
    Precise Capability Control (incl. Unlearning, Gradient Routing)
      Conclusion
None
1 comment

AI Control is a subfield of AI Safety research focused on ensuring the safety of powerful AI systems, particularly under the assumption that they might be misaligned or could become intentionally subversive (A Brief Explanation of AI Control [LW · GW]).

Unlike AI Alignment, which seeks to build inherently aligned systems, AI Control develops mechanisms to monitor, evaluate, constrain, verify, and manage the behavior of potentially untrustworthy AI to prevent catastrophic outcomes (The case for ensuring that powerful AIs are controlled [LW · GW]).

This review outlines key AI control methods discussed in recent literature. For a broader exploration of related topics, refer to An overview of areas of control work [? · GW].


1. Monitoring

Description: Monitoring encompasses techniques for observing AI system behavior, outputs, and sometimes internal states. The primary goal is to detect potentially harmful actions or intentions, particularly those the AI might try to conceal, enabling timely intervention or review.

Trusted Monitoring (and editing)

Untrusted Monitoring

Human Auditing

Guardrails

Reward Hacking Monitoring (incl. Chain-of-Thought Monitoring)

Activation-Based Safety Interventions (incl. Circuit Breakers)

Monitoring LLM Agents for Sequentially Contextual Harm


2. Control Evaluations

Description: Control evaluations are methodologies for testing whether a set of safety measures ("control protocols") can prevent an AI from intentionally causing unacceptable outcomes, assuming the AI might actively try to do so. These evaluations often involve a "red team" AI or human trying to subvert the "blue team's" safety protocols.

Subversion Strategy Evaluation

AI Control Levels (ACLs)

Control Evaluations in Multi-Step Agent Environments

Resample Protocols Evaluation


3. Guaranteed Safe AI

Description: Guaranteed Safe AI (GS AI) aims to produce AI systems with high-assurance quantitative safety guarantees, often relying on formal methods. Security measures and sandboxing are crucial components that contribute to achieving safety guarantees by enforcing constraints on the AI's actions and access.

Formal Safety Specifications

Verifiers (Formal Verification & Proof-Carrying Code)

Sandboxing

AI Control Safety Cases

Incident Reporting, Escalation, and Shutdown Processes


4. Capability Control

Description: Capability control aims to limit the abilities of AI systems, particularly those that could be misused for harmful purposes, while preserving usefulness for intended tasks.

Mitigating Steganography

Restricting Information Access and Communication

Precise Capability Control (incl. Unlearning, Gradient Routing)


Conclusion

The field of AI Control provides a critical layer of defense for managing risks associated with advanced AI systems, particularly when perfect alignment cannot be guaranteed.

This review highlights several key approaches: comprehensive Monitoring to detect suspicious behavior, rigorous Control Evaluations to test defenses, methodologies aiming for Guaranteed Safety through formal methods and security, and various forms of Capability Control to limit potential harm.

While promising, each category faces significant challenges in terms of scalability, robustness against sophisticated adversaries, and practical implementation. Continued research of these control methods within frameworks like safety cases will be essential for the future AI deployments.

1 comments

Comments sorted by top scores.

comment by peterslattery · 2025-04-19T13:17:27.652Z · LW(p) · GW(p)

Thank you for this. Much appreciated.