latterframe

Posts
Comments

Posts

A “Scaling Monosemanticity” Explainer 2024-06-29T17:50:49.855Z

Take SCIFs, it’s dangerous to go alone 2024-05-01T08:02:38.067Z

Comments

Comment by latterframe on Open Thread Spring 2024 · 2024-04-29T18:26:49.686Z · LW · GW

Hey everyone! I work on quantifying and demonstrating AI cybersecurity impacts at Palisade Research with @Jeffrey Ladish.

We have a bunch of exciting work in the pipeline, including:

demos of well-known safety issues like agent jailbreaks or voice cloning
replications of prior work on self-replication and hacking capabilities
modelling of above capabilities' economic impact
novel evaluations and tools

Most of my posts here will probably detail technical research or announce new evaluation benchmarks and tools. I also think a lot about responsible release, offence/defence balance, and general governance to flesh out my work's theory of change; some of that might also slip in.

See you around 🙃

User info

Posts

Comments