Posts

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs 2024-07-08T22:24:38.441Z
Takeaways from a Mechanistic Interpretability project on “Forbidden Facts” 2023-12-15T11:05:23.256Z
Update on Harvard AI Safety Team and MIT AI Alignment 2022-12-02T00:56:45.596Z

Comments

Comment by kaivu (kaivalya-hariharan) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-20T09:30:46.779Z · LW · GW

Thanks for bringing this up: this was a pretty confusing part of the evaluation.

Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).

You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.

The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.