Posts

Comments

Comment by wnx on Introducing Alignment Stress-Testing at Anthropic · 2024-01-22T10:26:53.760Z · LW · GW

Alignment approaches at different abstraction levels (e.g., macro-level interpretability, scaffolding/module-level AI system safety, systems-level theoretic process analysis for safety) is something I have been hoping to see more of. I am thrilled by this meta-level red-teaming work and excited to see the announcement of the new team.

Comment by wnx on Shallow review of live agendas in alignment & safety · 2023-12-08T13:54:42.591Z · LW · GW

Hey, great stuff -- thank you for sharing! I especially found this useful as somebody who has been "out" of alignment for 6 months and is looking to set up a new research agenda.