A Brief Explanation of AI Control

post by Aaron_Scher · 2024-10-22T07:00:56.954Z · LW · GW · 1 comments

Contents

1 comment

There are a bunch of great blog posts about the AI Control [? · GW] research direction, but there's not a short explanation/summary that I'm aware of. Here's an attempt at such an explanation, based on my understanding, and aimed at a reader with AISF-level knowledge of AI safety. 

AI Control has only begun receiving attention in the last year or so. Curious readers could read through the various posts on the topic [? · GW] from Buck Shlegeris [LW · GW], Ryan Greenblatt [LW · GW] and colleagues. Thanks to Buck for feedback on this explanation.

If you think I got anything wrong in this explanation, please leave a comment. 

  1. ^

    Here's a shot at explaining exploration hacking concisely: Exploration hacking is a method of sandbagging where a model never tries some useful strategy during training — thus not generating successful trajectories with that action, and not allowing the optimization process to enforce that strategy — and then deploys the useful strategy later (e.g., during an escape attempt). Said differently, there is some useful strategy that would have improved capabilities, but because the policy/model never explored the strategy in training, the strategy is not reinforced and thus the high capabilities are not elicited. 

    A key difficulty for a model trying to exploration hack is that it must be able to distinguish trajectories for which there exists supervised data (where underperformance would lead to weight updates against this underperformance; these could be generated by humans or models) from trajectories where it actually has an opportunity to exploration hack by not taking the high-performance trajectory (there is not a supervised/gold answer to compare the model's underperformance against). 

1 comments

Comments sorted by top scores.

comment by yams (william-brewer) · 2024-10-22T08:09:46.861Z · LW(p) · GW(p)

I like this post but I think redwood has varied some on whether control is for getting alignment work out of AIs vs getting generally good-for-humanity work out of them and pushing for a pause once they reach some usefulness/danger threshold (eg well before super intelligence).

[based on my recollection of Buck seminar in MATS 6]