nate-thomas

Posts
Comments

Posts

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter 2023-10-26T03:07:34.118Z

Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z

Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z

Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley 2022-10-27T01:32:44.750Z

High-stakes alignment via adversarial training [Redwood Research report] 2022-05-05T00:59:18.848Z

We're Redwood Research, we do applied alignment research, AMA 2021-10-06T05:51:59.161Z

Comments

Comment by Nate Thomas (nate-thomas) on Express interest in an "FHI of the West" · 2024-04-30T17:38:07.655Z · LW · GW

To anyone reading this who wants to work on or discuss FHI-flavored work: Consider applying to Constellation's programs (the deadline for some of them is today!), which include salaried positions for researchers.

Comment by Nate Thomas (nate-thomas) on Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter · 2023-10-26T14:50:00.131Z · LW · GW

Thanks, Neel! It should be fixed now.

Comment by Nate Thomas (nate-thomas) on Takeaways from our robust injury classifier project [Redwood Research] · 2022-09-17T15:04:15.614Z · LW · GW

Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"

User info

Posts

Comments