kieron-kretschmar

Posts
Comments

Posts

Experience Report - ML4Good AI Safety Bootcamp 2024-04-11T18:03:41.040Z

Comments

Comment by Kieron Kretschmar on Truth is Universal: Robust Detection of Lies in LLMs · 2024-08-02T09:40:11.212Z · LW · GW

The paper argues that there is one generalizing truth direction which corresponds to whether a statement is true, and one polarity-sensitive truth direction $t_{P}$ that corresponds to $X O R (i s_t r u e, i s_n e g a t e d)$ , related to Sam Marks' work on LLMs representing XOR-features. It further states that the truth directions for affirmative and negated statements are linear combinations of $t_{G}$ and $t_{P}$ , just with different coefficients.

Is there evidence that $t_{G}$ is an actual, elementary feature used by the language model, and not a linear combination of other features? For example, I could imagine that $t_{G}$ is a linear combination of features like e.g. $X O R (i s_t r u e, i s_f r e n c h)$ , or $A N D (i s_t r u e, i s_e n d_o f_s e n t e n c e)$ , ... .

Do you think we have reason to believe that $t_{G}$ is an elementary feature, and not a linear combination?

If the latter is the case, it seems to me that there is high risk of the probe failing when the distribution changes (e.g. on french text in the example above), particularly with XOR-features that change polarity.

User info

Posts

Comments