Posts

Can SAE steering reveal sandbagging? 2025-04-15T12:33:41.264Z
Hanoi – ACX Meetups Everywhere Spring 2025 2025-03-25T23:50:05.580Z
Shallow review of technical AI safety, 2024 2024-12-29T12:01:14.724Z
Hanoi Vietnam - ACX Meetups Everywhere Fall 2024 2024-08-29T18:35:52.315Z
Results from the AI x Democracy Research Sprint 2024-06-14T16:40:47.538Z
Hanoi – ACX Meetups Everywhere Spring 2024 2024-03-30T23:38:19.604Z

Comments

Comment by jordine on Can SAE steering reveal sandbagging? · 2025-04-16T11:21:56.402Z · LW · GW

Refusals were mostly 1-2%, so ignoring them doesn't change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn't matter

Comment by jordine on Shallow review of technical AI safety, 2024 · 2025-01-02T05:02:53.131Z · LW · GW

fixed! edited hyperlink.

Comment by jordine on Shallow review of technical AI safety, 2024 · 2024-12-30T04:30:56.007Z · LW · GW

edited, thanks for catching this!