Posts

Comments

Comment by Fogus McBogus on AI Control: Improving Safety Despite Intentional Subversion · 2023-12-14T17:23:35.840Z · LW · GW

God. How are we still at this level of primitive "safety"? Losh. Thank you for cracking on with experiments that humanity to date has somehow failed to do.

This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques.

Double thank for actually considering collusion. But where is this actually handling collusion? Is it this?

use redaction to make collusion harder

I was expecting something more than "we played with some GPTs again and sometimes caught them lying". This is weapons grade overselling!

...our empirical results aren’t the main reason that we think control is a promising approach to reducing risk from models intentionally subverting safety mechanisms–we mostly believe that because of conceptual arguments, some of which we’ll publish over the next few weeks.

  • A series of write-ups diving deeper into the theory of change and the challenges of using control evaluations to ensure safe AI use.

Aye, looking forward to it.