Posts
Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter
2023-10-26T03:07:34.118Z
Causal scrubbing: results on induction heads
2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker
2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix
2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
2022-12-03T00:58:36.973Z
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
2022-10-27T01:32:44.750Z
High-stakes alignment via adversarial training [Redwood Research report]
2022-05-05T00:59:18.848Z
We're Redwood Research, we do applied alignment research, AMA
2021-10-06T05:51:59.161Z
Comments
Comment by
Nate Thomas (nate-thomas) on
Express interest in an "FHI of the West" ·
2024-04-30T17:38:07.655Z ·
LW ·
GW
To anyone reading this who wants to work on or discuss FHI-flavored work: Consider applying to Constellation's programs (the deadline for some of them is today!), which include salaried positions for researchers.
Comment by
Nate Thomas (nate-thomas) on
Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter ·
2023-10-26T14:50:00.131Z ·
LW ·
GW
Thanks, Neel! It should be fixed now.
Comment by
Nate Thomas (nate-thomas) on
Takeaways from our robust injury classifier project [Redwood Research] ·
2022-09-17T15:04:15.614Z ·
LW ·
GW
Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"