lunatic_at_large's Shortform

post by lunatic_at_large · 2024-12-21T17:13:53.357Z · LW · GW · 3 comments

Contents

3 comments

3 comments

Comments sorted by top scores.

comment by lunatic_at_large · 2025-01-06T04:07:49.217Z · LW(p) · GW(p)

Something I should have realized about AI Safety via Debate ages ago but only recently recognized: I usually see theoretical studies of debate through the lens of determining the output of a fixed Turing machine (or stochastic Turing machine or some other computational model), e.g. https://arxiv.org/abs/2311.14125. This formalization doesn't capture the kinds of open-ended questions I usually think about. For example, suppose I'm looking at the blueprint of a submarine and I want to know whether it will be watertight at 1km depth. Suppose I have a physics simulation engine at my disposal that I trust. Maybe I could run the physics simulation engine at 1 nanometer resolution and get an answer I trust after ten thousand years, but I don't have time for that. This is such an extremely long computation that I don't think any AI debater would have time for it either. Instead, if I were tasked with solving this problem alone I would attempt to find some discretization parameter that is sufficiently small for me to trust the conclusion but sufficiently big for the computation to be feasible.

Now, if I had access to two debaters, how would I proceed? I would ask them both for thresholds beyond which their desired conclusion holds. For example, maybe I get the "it's watertight" debater to commit that any simulation at a resolution below 1cm will conclude that the design is watertight and I get the "it's not watertight" debater to commit that any simulation at a resolution below 1mm will conclude that the design is not watertight. I then run the simulation (or use one of the conventional debate protocols to simulate running the simulation) at the more extreme of the two suggestions, in this case 1mm. There are details to be resolved with the incentives but I believe it's possible.

I like to view this generalized problem as an infinite computation where we believe the result converges to the correct answer at some unknown rate. For example, we can run our simulation at 1m resolution, 10cm resolution, 1cm resolution, 1mm resolution, etc., saving the result from the most recent simulation as our current best guess. If we trust our simulation software then we should believe that this infinite computation eventually converges to the true answer. We can implement debate in this setup by having an initial debate over what steps T have the property that the answer at all steps T' >= T agrees with the answer at T, pick one such T that both sides agree with (take the maximum), then run a conventional debate protocol on the resulting finite computation. 

EDIT: I wonder if there's a generalization of this idea to having a directed set of world models where one world model is at least as good as another if it is at least as precise in every model detail. Each debater proposes a world model, the protocol takes the maximum of the world models which exists by the directed set property, and we simulate that world model. I'm thinking of the Guaranteed Safe AI agenda here.

comment by lunatic_at_large · 2024-12-31T19:29:27.368Z · LW(p) · GW(p)

Has anyone made a good, easy to use user interface for implementing debate / control protocols in deployment? For example, maybe I want to get o12345 to write some code in my codebase but I don't trust it. I'd like to automatically query Sonnet 17.5 with the code o12345 returned and ask if anything fishy is going on. If so, spin up one of the debate protocols and give me the transcript at the end. Surely the labs are experimenting with this kind of thing internally / during training but it still feels useful during deployment as an end-user, especially between models from different labs which might have internalized different objective functions and hence be less likely to collude. 

comment by lunatic_at_large · 2024-12-21T17:13:53.479Z · LW(p) · GW(p)

For anyone wanting to harness AI models to create formally verified proofs for theoretical alignment, it looks like last call for formalizing question statements.

Game theory is almost completely absent from mathlib4. I found some scattered attempts at formalization in Lean online but nothing comprehensive. This feels like a massive oversight to me -- if o12345 were released tomorrow with perfect reasoning ability then I'd have no framework with which to check its proofs of any of my current research questions.