lennart-buerger

Posts
Comments

Posts

Truth is Universal: Robust Detection of Lies in LLMs 2024-07-19T14:07:25.162Z

Comments

Comment by Lennart Buerger on Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs · 2024-11-07T13:14:41.133Z · LW · GW

Really interesting work, especially the three stages you find. Regarding your question whether your classifier will generalize to negated statements: This has been explored in our Neurips 2024 paper "Truth is Universal: Robust Detection of Lies in LLMs" (https://arxiv.org/abs/2407.12831). In fact, true and false negated statements separate along a different direction than statements without negation, so a classifier would not generalize. The truthfulness representation found in this paper is also universal and can be found in multiple LLMs which is consistent with your findings :)

Comment by Lennart Buerger on Truth is Universal: Robust Detection of Lies in LLMs · 2024-08-06T08:29:53.031Z · LW · GW

This is an excellent question! Indeed, we cannot rule out that is a linear combination or boolean function of features since we are not able to investigate every possible distribution shift. However, we showed in the paper that $t_{G}$ generalizes robustly under several significant distribution shifts. Specifically, $t_{G}$ is learned from a limited training set consisting of simple affirmative and negated statements on a restricted number of topics, all ending with a "." token. Despite this limited training data $t_{G}$ generalizes reasonably well to (i) unseen topics, (ii) unseen statement types, (iii) real-world scenarios, (iv) other tokens like "!" or ".'". I think that the real-world scenarios (iii) are a particularly significant distribution shift. However, I agree with you that tests on many more distribution shifts are needed to be highly confident that $t_{G}$ is indeed an elementary feature (if something like that even exists).

Comment by Lennart Buerger on JumpReLU SAEs + Early Access to Gemma 2 SAEs · 2024-07-25T14:21:08.589Z · LW · GW

Nice work! I was wondering what context length you were using when you extracted the LLM activations to train the SAE. I could not find it in the paper but I might also have missed it. I know that OpenAI used a context length of 64 tokens in all their experiments which is probably not sufficient to elicit many interesting features. Do you use a variable context length or also a fixed value?

User info

Posts

Comments