Posts
Ablations for “Frontier Models are Capable of In-context Scheming”
2024-12-17T23:58:19.222Z
Frontier Models are Capable of In-context Scheming
2024-12-05T22:11:17.320Z
Apollo Research 1-year update
2024-05-29T17:44:32.484Z
A starter guide for evals
2024-01-08T18:24:23.913Z
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
2023-11-07T17:59:36.857Z
Understanding mesa-optimization using toy models
2023-05-07T17:00:52.620Z
Comments
Comment by
rusheb on
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] ·
2023-01-18T13:29:46.174Z ·
LW ·
GW
If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis
Would it be possible to make interventions which we expect not to preserve the model's behaviour, and assert that the behaviour does in fact change?