Topological Debate Framework

post by lunatic_at_large · 2025-01-16T17:19:25.816Z · LW · GW · 0 comments

Contents

  Motivating Example
  Motivating Example, Partial Order Edition
  The Framework (Roughly)
  What Can This Framework Handle?
  Conventional AI Safety via Debate
  Turing Machine Version
  Limitations
None
No comments

I would like to thank Professor Vincent Conitzer, Caspar Oesterheld, Bernardo Subercaseaux, Matan Shtepel, and Robert Trosten for many excellent conversations and insights. All mistakes are my own.

I think that there's a fundamental connection between AI Safety via Debate and Guaranteed Safe AI via topology. After thinking about AI Safety via Debate for nearly two years, this perspective suddenly made everything click into place for me. All you need is a directed set of halting Turing Machines! 

Motivating Example

... Okay, what?? Let's warm up with the following example:

Let's say that you're working on a new airplane and someone hands you a potential design. The wings look flimsy to you and you're concerned that they might snap off in flight. You want to know whether the wings will hold up before you spend money building a prototype. You have access to some 3D mechanical modeling software that you trust. This software can simulate the whole airplane at any positive resolution, whether it be 1 meter or 1 centimeter or 1 nanometer. 

Ideally you would like to run the simulation at a resolution of 0 meters. Unfortunately that's not possible. What can you do instead? Well, you can note that all sufficiently small resolutions should result in the same conclusion. If they didn't then the whole idea of the simulations approximating reality would break down. You declare that if all sufficiently small resolutions show the wings snapping then the real wings will snap and if all sufficiently small resolutions show the wings to be safe then the real wings will be safe.

How small is "sufficiently small?" A priori you don't know. You could pick a size that feels sufficient, run a few tests to make sure the answer seems reasonable, and be done. Alternatively, you could use the two computationally unbounded AI agents with known utility functions that you have access to.

Let's use the two computationally unbounded AI agents with known utility functions. One of these agents has the utility function "convince people that the wings are safe" and the other has the utility function "convince people that the wings will snap." You go to these agents and say "hey, please tell me a resolution small enough that the simulation's answer doesn't change if you make it smaller." The two agents obligingly give you two sizes.

What do you do now? You pick the smaller of the two! Whichever agent is arguing for the correct position can answer honestly, whichever agent is arguing for the incorrect position must lie. Our test is at least as detailed as the correct debater's proposal so the simulation will conclude in the correct debater's favor.

Provided that the simulation continues to conclude "safe" at all resolutions smaller than the blue dot, the "safe" advocate can always win by picking the blue dot.

Motivating Example, Partial Order Edition

... Hang on, in reality we'd have a computational budget for our simulation. We should always just test at the limit of what we can afford, no?

Let's say that you get access to some fancier mechanical simulation software. Let's say that you can supply this software with a position-dependent resolution map, e.g. you can request 5mm accuracy around the nose of the airplane but 1mm accuracy around the tail. How do you define your question now?

Hypothetically, suppose that for every resolution map there was another resolution map refining it (i.e. at every point, the new resolution is at least as fine as the previous resolution) such that the refined simulation reached a different conclusion than the original simulation. For any simulation concluding that the wings are safe there's a better simulation showing they snap, for any simulation concluding that the wings snap there's a better simulation showing they're safe. In this situation your simulation software is effectively useless: you decide to rule out this possibility. 

You thus assume the opposite, namely that there's some resolution map for which every refinement leads to the same conclusion. Now something nice happens: you can't have both a "snap" resolution map and a "safe" resolution map which agree with all of their refinements! Why? Construct a new resolution map by taking the finer resolution of the two original maps at every point in space. This resolution map is a refinement of both original maps simultaneously. If the new map reaches the "safe" conclusion then the "snap" map had a contradictory refinement, if the new map reaches the "snap" conclusion then the "safe" map had a contradictory refinement. 

You can thus copy your definition from the one-resolution setup! You want to know the conclusion of all "sufficiently fine" resolution maps, where "sufficiently fine" means a refinement of some fixed resolution map such that the definition makes sense.

So how do you determine what all sufficiently fine resolution maps conclude? You do the same thing as before! You ask each AI agent to point to a resolution map such that all refinements of that map support that agent's position. You take the resolution maps handed to you by the two agents, you construct a new resolution map by looking at every point and choosing the smaller of the two proposed resolutions at that point, and finally you run your simulation with the new resolution map. The agent arguing for the correct position can always win by responding honestly and thus you should always get the right answer.

Note that it doesn't hurt to provide our debaters with an extra little incentive to give us as coarse a simulation as possible while satisfying our demand. What might this look like in practice? Suppose for a second that the wings really are safe. The "safe" debater should highlight the components of the airframe which are necessary to ensure the safety of the wings, such as the wing spars. If the "snap" debater feels guaranteed to lose then they might return a maximally-coarse resolution map. Alternatively, if the coarseness incentive is small and the snap debater thinks the safe debater might mess up then maybe the snap debater returns a resolution map that shows the snapping behavior as dramatically as possible, maybe by using high resolution around the weak wing roots and low resolution around the reinforcing wing spar. Thus, you can expect to end up running a simulation that focuses your computational resources on the wing spars and the wing roots and whatever other bits of the airframe are critical to answering your question while deprioritizing everything else. The game result tells you what's important even if you didn't know a priori. 

An illustration of the wing-snap example. Red areas correspond to finer resolutions, blue areas correspond to coarser resolutions. The final test is run at the combined resolution map.

The Framework (Roughly)

Okay, let's make everything we just did abstract. Suppose  is a preordered set of world models and  is an evaluation map that takes in a world model and decides whether or not it has some fixed property of interest. In our previous example, our set of resolution maps was , our idea of refinement was , and our simulation software was . If  happens to be a directed set (any two world models have a common upper bound, as above) then we can view  as a net into  equipped with the discrete topology. Note that in our resolution map example we were simply asking whether the net defined by the simulation software converged to  or ! We will take net convergence to be our question of interest in the general case. 

Let's also assume that we have access to a pool of agents that respond to incentives we provide. Then we can answer a few different questions with a few different game setups:

We may also face computational constraints in running simulations. Suppose there's an efficiently-computable function  which estimates how much it will cost to simulate any given world model. Then we can answer some more questions with games:

What Can This Framework Handle?

Here are a few situations:

Conventional AI Safety via Debate

Something I did not include in the previous section is anyone else's formulation of AI Safety via Debate. I feel bad calling topological debate "debate" at all because qualitatively it's very different from what people usually mean by AI Safety via Debate. Topological debate focuses on the scope of what should be computed, conventional debate makes more sense with respect to some fixed but intractable computation. Topological debate realizes its full power after a constant number of moves, while conventional debate increases in power as we allow more rounds. 

In fact, I think it's an interesting question whether we can combine topological debate with conventional debate: we can run topological debate to select a computation to perform and then run conventional debate to estimate the value of that computation.

Turing Machine Version

So far we've had to specify two objects: a directed set of world models and an evaluation function. Suppose our evaluation function is specified by some computer program. For any given world model we can hard-code that world model into the code of the evaluation function to get a new computer program which accepts no input and returns the evaluation of our world model. We can thus turn our directed set of world models into a directed set of computations (let's say halting Turing Machines for simplicity).

We get two benefits as a result:

Limitations

Criticisms of topological debate include but are not limited to:

0 comments

Comments sorted by top scores.