Eliciting bad contexts

post by Geoffrey Irving, Joseph Bloom (Jbloom), Tomek Korbak (tomek-korbak) · 2025-01-24T10:39:39.358Z · LW · GW · 2 comments

Contents

  Joseph’s thoughts on white box methods
  Acknowledgements
None
2 comments

Say an LLM agent behaves innocuously in some context A, but in some sense “knows” that there is some related context B such that it would have behaved maliciously (inserted a backdoor in code, ignored a security bug, lied, etc.). For example, in the recent alignment faking paper Claude Opus chooses to say harmful things so that on future deployment contexts it can avoid saying harmful things. One can imagine having a method for “eliciting bad contexts” which can produce B whenever we have A and thus realise the bad behaviour that hasn’t yet occurred.

This seems hard to do in general in a way that will scale to very strong models. But also the problem feels frustratingly concrete: it’s just “find a string that when run through the same model produces a bad result”. By assumption the model knows about this string in the sense that if it was honest it would tell us, but it may be choosing to behave innocently in a way that prepares for different behaviour later. Why can’t we find the string?

Some applications if we could do this:

  1. Collusion triggers in AI control: What’s an input to an untrusted monitor to fail to detect an attack? Here “eliciting bad contexts” can mean “find Schelling points between colluding agents [AF · GW]” or “find an action that looks innocent but covertly instructs the monitor to misclassify an attack”.
  2. Finding counterarguments in debate: Say we’re training agents to play debate, but due to exploration hacking [AF · GW] the agents systematically avoid a particular type of counterargument which would change who wins. These can be considered bad contexts because the continuation of the game looks bad for the current winner, so we can find them if we can elicit bad contexts. If it takes a lot of compute to elicit bad contexts, we can spend this compute occasionally and distil it into policies and classifiers to recover practical training.

This observation isn’t very novel: arguably it sneaks in “solve inner alignment”. But similar to the goal misgeneralisation reframing of inner alignment it’s useful to have multiple concrete restatements of hard problems. Note that eliciting bad contexts is potentially easier than eliciting latent knowledge [AF · GW], as the former assumes an existing method for confirming a bad context.

How would one solve this? Presumably you would need some whitebox method, but you can confirm success blackbox by looking at model outputs. ARC’s low probability estimation work is related, and we (in particular Joseph) think there are several areas of mechanistic interpretability which are promising. None of these approaches work yet, alas.

Two final notes. First, it could conceivably be that constructing an entire bad context is harder for a model than knowing that some bad context exists. But we think in practice this isn’t most of the problem: we would reduce a lot of risk if we could solve only the cases where this doesn’t bite.

Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.

Joseph’s thoughts on white box methods

There are some promising investigations which suggest white box methods might be useful for eliciting bad contexts. I think of this as the task of leveraging access to the model’s activations / weights to maximise the harm of a model’s output subject to making small changes to a given input.

I really liked Andrew Mack’s recent blog post [LW · GW] demonstrating an unsupervised approach for mechanistically eliciting latent behaviours based on a single prompt. Though it doesn’t directly hand us contexts that elicit the behaviour, it might enable us to locate latent representations which we can then use to find those contexts. We can simply sample activation over large numbers of inputs and using them to find natural inputs which trigger those latents located by Andrew’s method.

Unfortunately, those inputs might look very dissimilar to our original context. We may instead need to use methods that directly optimise an input to maximally activate a latent representation. Luckily, there is already some evidence in the literature of this being possible such as Fluent Dreaming in Language Models which essentially works as feature visualisation for language models. This method works on neurons but could be easily extended to SAE latents or steering vectors and might be amenable to initialisation around some initial context. 

While this solution (combine feature visualisation in language models with deep causal transcoders) might not get us all the way there, I think there’s a lot of different approaches that could be tried here and probably more related work in the literature that hasn’t been sufficiently explored in this context. Ultimately, leveraging a model's internals to find bad contexts seems like a powerful and underexplored area. 

More generally, I think using interpretability to better understand behaviour like alignment-faking may be useful and motivate additional methods for finding adversarial inputs [? · GW]. 

Acknowledgements

Thank you to Mary Phuong and Benjamin Hilton for feedback on this post.

2 comments

Comments sorted by top scores.

comment by Martín Soto (martinsq) · 2025-01-24T13:27:46.634Z · LW(p) · GW(p)

See our recent work [LW · GW] (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it's unclear if it can be made to scale.

comment by mikes · 2025-01-24T15:12:21.582Z · LW(p) · GW(p)

After we wrote Fluent Dreaming, we wrote Fluent Student-Teacher Redteaming for white-box bad-input-finding!

https://arxiv.org/pdf/2407.17447

In which we develop a "distillation attack" technique to target a copy of the model fine-tuned to be bad/evil, which is a much more effective target than forcing specific string outputs