Nicholas Goldowsky-Dill's Shortform
post by Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2024-11-06T12:37:10.456Z · LW · GW · 2 commentsContents
2 comments
2 comments
Comments sorted by top scores.
comment by Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2024-11-06T12:37:10.832Z · LW(p) · GW(p)
One question I have about the Mechanistic Interpretability safety case sketch from Roger Grosse: How do you actually find features of high concern?
Features of High Concern are things like "intent to carry out a harmful plan". By definition, they very rarely activate on benign deployment outputs. This must be the case as "if the features [of high concern] activate more than a handful of times in deployment (internal or external), we discontinue the model and roll back to a weaker model." [M1.1]
They propose to find these features [S1.1] by performing dictionary learning of some form, and then some combination of:
- autointerp on the learned features – which currently requires examples of the feature activating. I guess you can also look at attribution of bad outcomes to features?
- looking at what features activate "when the model is instructed or perturbed to be misaligned."
Generally dictionary learning only learns features that actually occur on some inputs. So you need to do something outside of the normal deployment distribution. Some ideas (many alluded to in the document):
- Learn the features on the base model (or a ‘agentic-only’ version of the model) and then hope they transfer to the safety-trained model
- Learn the features by gathering the activations on off-policy transcripts of harmful behavior. This is what Fabien Rodger does in coup-probes [AF · GW]
- Prompting the model to behave badly. This probably can’t be sufficient for features of high concern – you don’t want your model to take catastrophic actions even when instructed to do so! Maybe if you catch jailbreaks with a separate mechanism that you disable?
- Perturbing the model to behave badly. For example:
- Something in the genre of Mack and Turner's mechanistically eliciting latent behaviors [AF · GW] – e.g. this works to discover sleeper agents.
- Maybe steering for related features? It’s not clear to me how you do dictionary learning on a steered model and then apply those features to the unsteered model. But something in this genre seems plausible.
The situation is easier for features of medium concern, which are only problematic in certain contexts. Ideally you can learn the features in benign contexts and they’ll transfer to catastrophic contexts. For example, it’s okay if your model lies when roleplaying as a shady car salesman, when playing mafia, or in other low-stakes scenarios. If you can use these to learn a “deception” feature it might activate when the model assures you that some backdoored code is safe. You’d still want to validate this transfer empirically!
I think it would be valuable to think about how this could work in specific scenarios. That is, specify different combinations of
- Feature of high concern
- High level description of a misaligned model & its behavior
- Catastrophic action (that would involve the feature of high concern)
Then try to answer the question of how you'd learn the feature such that it only activates in deployment when the model is behaving catastrophically. Do this for examples like
- Scheming model intentionally inserting a backdoor into the codebase of the inference server
- Model trained with poisoned data that behaves badly in response to some particular trigger
- Oversight model that is colluding with the model it monitors, and doesn't flag an unsafe action by the monitored model
All of these strategies rely on the hope of feature transfer to the on-policy deployment distribution. It seems possible to get enough evidence to think it's likely these features transfer but difficult to have enough evidence to be highly confident. As they mention, having multiple realistic model organisms seems necessary for validation. Where do you get model organisms? Maybe
- Model organisms of data poisoning – seem doable (e.g. sleeper agents)
- Model organisms of scheming – seems difficult to make highly realistic but worth trying
- Catch the first model that schemes against you and use it as a model organism – a risky game to be playing! But this is a good reason to pursue control strategies.
↑ comment by RogerGrosse · 2024-11-09T21:58:09.250Z · LW(p) · GW(p)
Great questions. Finding features of high concern is indeed something we highlighted as an open problem, and it's not obvious whether this will be achievable. The avenues you list are plausible ones. It might also turn out that it is very hard to find individual features which are sufficiently indicative of malign behavior, but that one could do so for circuits or specify some other rule involving combinations of features. E.g., the criterion might involve something like "feature activates which indicates the intent to carry out a harmful plan, and features indicating attribution of the thoughts to a fictional character are not active." (As discussed in Limitations, one could apply a variant of this safety case based on circuits rather than features, and the logic would be similar.)
We're agnostic about where the model organisms would come from, but I like your suggestions.