Posts

[Linkpost] Play with SAEs on Llama 3 2024-09-25T22:35:44.824Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z

Comments

Comment by Tom McGrath on "Safety as a Scientific Pursuit" (2024) · 2024-01-24T19:30:36.089Z · LW · GW

Very much appreciate the link post - I’d been trying to write a summary/contextualisation for LW and this is a much better one than I’d come up with.

I’d be very grateful for the LW community’s thoughts (especially any pushback). I expect this will be the source of the strongest counterarguments.

Comment by Tom McGrath on "Safety as a Scientific Pursuit" (2024) · 2024-01-24T19:28:04.290Z · LW · GW

Thanks! I really like inductive vs deductive and would probably have used them if I’d thought of it.

Comment by Tom McGrath on "Acquisition of Chess Knowledge in AlphaZero": probing AZ over time · 2021-11-19T16:06:35.373Z · LW · GW

I'm one of the authors on this paper - happy to answer any questions/discuss if anyone is interested.

Comment by Tom McGrath on "Acquisition of Chess Knowledge in AlphaZero": probing AZ over time · 2021-11-19T16:06:06.048Z · LW · GW

Thanks for the summary! Your first bullet point was my motivation for doing this. I think it's important to test out interpretability ideas in more challenging domains. 

We didn't really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I'd say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.