Posts
A “Scaling Monosemanticity” Explainer
2024-06-29T17:50:49.855Z
Take SCIFs, it’s dangerous to go alone
2024-05-01T08:02:38.067Z
Comments
Comment by
latterframe on
Open Thread Spring 2024 ·
2024-04-29T18:26:49.686Z ·
LW ·
GW
Hey everyone! I work on quantifying and demonstrating AI cybersecurity impacts at Palisade Research with @Jeffrey Ladish.
We have a bunch of exciting work in the pipeline, including:
- demos of well-known safety issues like agent jailbreaks or voice cloning
- replications of prior work on self-replication and hacking capabilities
- modelling of above capabilities' economic impact
- novel evaluations and tools
Most of my posts here will probably detail technical research or announce new evaluation benchmarks and tools. I also think a lot about responsible release, offence/defence balance, and general governance to flesh out my work's theory of change; some of that might also slip in.
See you around 🙃