rusheb

Posts
Comments

Posts

Ablations for “Frontier Models are Capable of In-context Scheming” 2024-12-17T23:58:19.222Z

Frontier Models are Capable of In-context Scheming 2024-12-05T22:11:17.320Z

Apollo Research 1-year update 2024-05-29T17:44:32.484Z

A starter guide for evals 2024-01-08T18:24:23.913Z

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation 2023-11-07T17:59:36.857Z

Understanding mesa-optimization using toy models 2023-05-07T17:00:52.620Z

Comments

Comment by rusheb on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-01-18T13:29:46.174Z · LW · GW

If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis

Would it be possible to make interventions which we expect not to preserve the model's behaviour, and assert that the behaviour does in fact change?

User info

Posts

Comments