Posts

An information-theoretic study of lying in LLMs 2024-08-02T10:06:39.312Z
Implementing activation steering 2024-02-05T17:51:55.851Z
Classifying representations of sparse autoencoders (SAEs) 2023-11-17T13:54:02.171Z
Evaluating hidden directions on the utility dataset: classification, steering and removal 2023-09-25T17:19:13.988Z

Comments

Comment by Annah (annah) on Alignment Faking in Large Language Models · 2024-12-20T17:00:17.721Z · LW · GW

complied

should it not say "refused" here since you are talking about the new goal of replying to harmful requests?

Comment by Annah (annah) on Classifying representations of sparse autoencoders (SAEs) · 2023-11-17T20:13:35.896Z · LW · GW

The relative difference in the train accuracies looks pretty similar. But yeah, @SenR already pointed to the low number of active features in the SAE, so that explains this nicely.

Comment by Annah (annah) on Classifying representations of sparse autoencoders (SAEs) · 2023-11-17T19:50:56.056Z · LW · GW

Yeah, this makes a ton of sense. Thx for taking the time to give it a closer look and also your detailed response :)

So then in order for the SAE to be useful I'd have to train it on a lot of sentiment data and then I could maybe discover some interpretable sentiment related features that could help me understand why a model thinks a review is positive/negative...

Comment by Annah (annah) on Classifying representations of sparse autoencoders (SAEs) · 2023-11-17T19:25:59.551Z · LW · GW

I'm not quite sure what you mean with "the sentiment will not be linearly separable". 

The hidden states are linearly separable (to some extend), but the sparse representations perform worse than the original representations in my experiment. 

I am training logistic regression classifiers on the original, and sparse representations respectively, so I am multiplying the residual stream states (and their sparse encodings) with weights. These weights could (but don't have to) align with some meaningful direction like hidden_states("positive")-hidden_states("negative").

I'm not sure if I understood your comment about the logit lens. Are you proposing this as an alternative way of testing for linear separability? But then shouldn't the information already be encoded in the hidden states and thus extractable with a classifier?

Comment by Annah (annah) on Evaluating hidden directions on the utility dataset: classification, steering and removal · 2023-09-26T12:53:10.673Z · LW · GW

Thx for the feedback. Fixed typo and added ITI reference.