Posts

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter 2023-10-26T03:07:34.118Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley 2022-10-27T01:32:44.750Z
High-stakes alignment via adversarial training [Redwood Research report] 2022-05-05T00:59:18.848Z
We're Redwood Research, we do applied alignment research, AMA 2021-10-06T05:51:59.161Z

Comments

Comment by Nate Thomas (nate-thomas) on Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter · 2023-10-26T14:50:00.131Z · LW · GW

Thanks, Neel! It should be fixed now.

Comment by Nate Thomas (nate-thomas) on Takeaways from our robust injury classifier project [Redwood Research] · 2022-09-17T15:04:15.614Z · LW · GW

Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"