Mechanistic Anomaly Detection Research Update
post by Nora Belrose (nora-belrose), David Johnston (david-johnston) · 2024-08-06T10:33:26.031Z · LW · GW · 0 commentsThis is a link post for https://blog.eleuther.ai/mad_research_update/
Contents
No comments
Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods [LW · GW] for detecting anomalous behavior in language models based on Neel Nanda's attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.
We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at.
We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.
Thanks to @David Johnston [LW · GW] and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner [LW · GW] for fruitful discussion, ideas, and code!
Code: https://github.com/EleutherAI/cupbearer/tree/attribution_detector
0 comments
Comments sorted by top scores.