Posts
Can SAE steering reveal sandbagging?
2025-04-15T12:33:41.264Z
Hanoi – ACX Meetups Everywhere Spring 2025
2025-03-25T23:50:05.580Z
Shallow review of technical AI safety, 2024
2024-12-29T12:01:14.724Z
Hanoi Vietnam - ACX Meetups Everywhere Fall 2024
2024-08-29T18:35:52.315Z
Results from the AI x Democracy Research Sprint
2024-06-14T16:40:47.538Z
Hanoi – ACX Meetups Everywhere Spring 2024
2024-03-30T23:38:19.604Z
Comments
Comment by
jordine on
Can SAE steering reveal sandbagging? ·
2025-04-16T11:21:56.402Z ·
LW ·
GW
Refusals were mostly 1-2%, so ignoring them doesn't change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn't matter
Comment by
jordine on
Shallow review of technical AI safety, 2024 ·
2025-01-02T05:02:53.131Z ·
LW ·
GW
fixed! edited hyperlink.
Comment by
jordine on
Shallow review of technical AI safety, 2024 ·
2024-12-30T04:30:56.007Z ·
LW ·
GW
edited, thanks for catching this!