Posts
Comments
Really great list!
- It might be that sampling from the model sometimes samples an alignment faking persona and sometimes doesn't. Then, sampling several times to get an ensemble is helpful.
This could be generalized to any kind of inconsistency that leads to the model not always choosing to lie. E.g. The model makes reasoning errors some fraction of the time, has imperfect situational awareness, or picks inconsistent collusion signals.
You can also resample during red-teaming to increase the effective length of the upfront red-teaming-of-monitor period, forcing the model to have more consistent goals, reasoning, situational awareness, and Schelling points to evade being caught.
The general concern here is that your human auditors might not be perfect. I think in practice when making a safety case for a deployment, it would make sense to run a red-team/blue-team game with a human/AI red-team and your real auditors to see if your auditors are reliable.
I'm not sure how the proposed steganographic attack would work though. How do you ensure that your backdoored code will hash to XYZ if you can't see the backdoor opportunity until after you've decided on XYZ?
(Not an author, but Buck suggested I look at this)
I think there's a few reasons why it's not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:
- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 0.9^30 = 4% chance of passing the upfront eval period. (This is a lossy translation of the 90% success rate in Mathew et al. E.g. Greenblatt et al. targets a 1.4% overall positive rate after the upfront eval, while Mathew et al. calibrates to at least a 7% false positive rate; The suspiciousness scores in Greenblatt et al. come from the receiver itself.)
- From the above, winrate is quite sensitive to the reliability of collusion, and it's pretty unclear what the error-bars should be around the 90% number.
- The untrusted model in Grennblatt et al. not only has to implement the steganographic signal and complete the task but also has to implement a backdoor.
- It's a different distribution of tasks from the APPS backdoor setting.
So overall I think it's pretty unclear how well the steganography in Mathew et al. would compare to the while-loop signal that performed best in Greenblatt et al.
With appropriate collusion mitigations I think untrusted monitoring is an excellent control protocol so I'm glad to see work investigating more natural learned collusion capabilities and the effectiveness of countermeasures. I'd be excited to see evaluations of these steganographic strategies in the context of untrusted monitoring so that the results have an easier interpretation in terms of safety and usefulness.
I think that if they:
- Find a feature that is necessary for defection in all of the cases of defection in a diverse and realistic set of model organisms
- Locate that feature with very little optimization on the defection cases
I'd be fairly confident (>~80%) that it would catch any particular deployment-time defection attempt from an ASL-4 schemer. This would cut risk from the deployment by >5x when paired with compelling incrimination and a shutdown. This is a substantial risk reduction, and I might even buy this safety case if there weren't better options for this threat model (e.g. untrusted monitoring). My main concerns are:
- It seems unlikely we would get a feature that meets the above two criteria
- The feature might have too many false positives
- The possibility that the model organisms were sabotaging your eval by coordinating to route their defection attempts through a particular feature (this seems hard at ASL-4 though (but this claim would need an eval)).
This is really exciting to see not only as a way to understand better what people at Anthropic are thinking about how to make ASL-4 safety cases but also as a survey of how various safety agendas might play a role in practice!
Here are some thoughts about the details:
- Given that the mechinterp safety case involves running a kind of control evaluation, other monitoring methods can largely be substituted into the eval. This could enable cheap testing of other monitors e.g. based on supervised probing or mechanistic anomaly detection.
- I like the idea to check whether a feature is causally necessary for defection in your model organisms as this gives extra certainty about generalization recall.
- "Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution." (As is probably mentioned by others) This isn’t necessary for control safety cases as long as you’ve done a sufficient job of eliciting performance from your model (e.g. i.i.d. elicitation with human demonstrations of attacks, which as is mentioned might be difficult).
- The third kind of safety case is pretty novel to me which makes it exciting but also I’m not convinced by it yet, for reasons that are mostly already outlined in the blogpost.
- The evidence becomes very diluted by multiple hypothesis testing when trying to rule out any cluster of deceptive personas being reinforced. We probably wouldn’t be able to get convincing affirmative evidence with this approach without a meta-level eval.
- I think exploration hacking shouldn’t be a concern here if you can get the 10-90% with SFT. (EDIT: However you might be concerned that the 10-90% does not contain the particular deceptive personas that get reinforced.)
Their addendum contains measurements on refusals and harmlessness, though these aren't that meaningful and weren't advertised.
You seem to be talking primarily about Ilya, but it seems especially unclear why Jan is also leaving now.
I think this is a very important and neglected area! That its tractability is low is one of its central features, but progress on it now greatly aides in progress later by making these hard-to-meaure aims easier to measure.
The only way ChatGPT can control anything is by writing text, so figuring out that it should write the text that should appear in the image seems pretty straightforward. It only needs to rationalize why this would work.
I think the relevant notion of "being an agent" is whether we have reason to believe it generalizes like a consequentialist (e.g. its internal cognition considers possible actions and picks among them based on expected consequences and relies minimally on the imitative prior). This is upstream of the most important failure modes as described by Roger Grosse here.
I think Sora is still in the bottom left like LLMs, as it has only been trained to predict. Without further argument or evidence I would expect that it probably for the most part hasn't learned to simulate consequentialist cognition, similar to how LLMs haven't demonstrated this ability yet (e.g. fail to win a chess game in an easy but OOD situation).
I'll add another one to the list: "Human-level knowledge/human simulator"
Max Nadeau helped clarify some ways in which this framing introduced biases into my and others' models of ELK and scalable oversight. Knowledge is hard to define and our labels/supervision might be tamperable in ways that are not intuitively related to human difficulty.
Different measurements of human difficulty only correlate at about 0.05 to 0.3, suggesting that human difficulty might not be a very meaningful concept for AI oversight, or that our current datasets for experimenting with scalable oversight don't contain large enough gaps in difficulty to make meaningful measurements.
Adversarial mindset. Adversarial communication is to some extent necessary to communicate clearly that Conjecture pushes back against AGI orthodoxy. However, inside the company, this can create a soldier mindset and poor epistemic. With time, adversariality can also insulate the organization from mainstream attention, eventually making it ignored.
This post has a battle-of-narratives framing and uses language to support Conjecture's narrative but doesn't actually argue why Conjecture's preferred narrative puts the world in a less risky position.
There is an inside view and an outside view reason why I argue the cooperative route is better. First, being cooperative is likely to make navigating AI risks and explosive-growth smoother, and is less likely to lead to unexpected bad outcomes. Second is the unilateralist's curse. Conjecture's empirical views on the risks posed by AI and the difficulty of solving them via prosaic means are in the minority, probably even within the safety community. This minority actor shouldn't take unilateral action whose negative tail is disproportionately large according to the majority of reasonable people.
Part of me wants to let the Conjecture vision separate itself from the more cooperative side of the AI safety world, as it has already started doing, and let the cooperative side continue their efforts. I'm fairly optimistic about these efforts (scalable oversight, evals-informed governance, most empirical safety work happening at AGI labs). However, the unilateral action supported by Conjecture's vision is in opposition to the cooperative efforts. For example a group affiliated with Conjecture ran ads in opposition to AI safety efforts they see as insufficiently ambitious in a rather uncooperative way. As things heat up I expect the uncooperative strategy to become substantially riskier.
One call I'll make is for those pushing Conjecture's view to invest more into making sure they're right about the empirical pessimism that motivates their actions. Run empirical tests of your threat models and frequently talk to reasonable people with different views.
(Initial reactions that probably have some holes)
If we use ELK on the AI output (i.e. extensions of Burns's Discovering Latent Knowledge paper, where you feed the AI output back into the AI and look for features that the AI notices), and somehow ensure that the AI has not learned to produce output that fools its later self + ELK probe (this second step seems hard; one way you might do this is to ensure the features being picked up by the ELK probe are actually the same ones used/causally responsible for the decision in the first place), then it seems initially to me that this would solve deep deception.
Then any plan that looks to the AI like it's deception could be caught regardless of which computations led to them (it seems like we can make ELK cheap enough to run at inference time).
It's a tool for interacting with language models: https://generative.ink/posts/loom-interface-to-the-multiverse/