clement-dumas

Posts
Comments

Posts

Finding the estimate of the value of a state in RL agents 2024-06-03T20:26:59.385Z

Aspiration-based Q-Learning 2023-10-27T14:42:03.292Z

Comments

Comment by Clément Dumas (butanium) on One-shot steering vectors cause emergent misalignment, too · 2025-04-14T10:06:39.213Z · LW · GW

Cool post! Did you try steering with the "Killing all humans" vector? Does it generalize as well as others, and are the responses similar?

Comment by Clément Dumas (butanium) on AI for Epistemics Hackathon · 2025-03-23T05:46:43.407Z · LW · GW

Just asked Claude for thoughts on some technical mech interp paper I'm writing. The difference with and without the non-sycophantic prompt is dramatic (even with extended thinking):
Me: What do you think of this bullet point structure for this section based on those results?
Claude (normal): I think your proposed structure makes a lot of sense based on the figures and table. Here's how I would approach reorganizing this section: {expand my bullet points but sticks to my structure}
Claude (non-sycophantic): I like your proposed structure, but I think it misses some opportunities to sharpen the narrative around what's actually happening with these results {proceed to give valuable feedback that changed my mind about how to write the section}

Comment by Clément Dumas (butanium) on Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens · 2025-03-10T21:22:29.376Z · LW · GW

I like this idea! Looking forward to your progress 👀

Comment by Clément Dumas (butanium) on Activation space interpretability may be doomed · 2025-01-11T18:20:55.168Z · LW · GW

This is also a concern I have but I feel like steering / project out is kinda sufficient to understand if the model uses this feature.

Comment by Clément Dumas (butanium) on Latent Adversarial Training (LAT) Improves the Representation of Refusal · 2025-01-08T16:34:50.633Z · LW · GW

Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?

Comment by Clément Dumas (butanium) on Latent Adversarial Training (LAT) Improves the Representation of Refusal · 2025-01-06T16:51:56.858Z · LW · GW

Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.

Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?

Comment by Clément Dumas (butanium) on Latent Adversarial Training (LAT) Improves the Representation of Refusal · 2025-01-06T16:50:59.374Z · LW · GW

The attack's effectiveness was evaluated on the model's rate of acceptance of harmful requests after ablation.

How do you check if the model accepts your request?

Comment by Clément Dumas (butanium) on Greedy-Advantage-Aware RLHF · 2024-12-31T11:41:07.823Z · LW · GW

I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.

Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar to https://arxiv.org/abs/2210.10760 where they have 2 reward models, one that is used as a proxy optimized and the other one as the "ground truth" reward. If it generalizes to those scenario I'd be much more enthusiastic about your approach!

Comment by Clément Dumas (butanium) on Extracting SAE task features for in-context learning · 2024-08-13T09:58:12.265Z · LW · GW

Nice work!

I'm curious about the cleanliness of a task vector after removing the mean of some corrupted prompts (i.e., same format but with random pairs). Do you plan to run this stronger baseline, or is there a notebook/codebase I could easily tweak to explore this?

Comment by Clément Dumas (butanium) on Self-explaining SAE features · 2024-08-06T17:24:09.262Z · LW · GW

Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter

In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier

Comment by Clément Dumas (butanium) on Self-explaining SAE features · 2024-08-06T13:36:08.719Z · LW · GW

Did you also try to interpret input SAE features?

Comment by Clément Dumas (butanium) on Self-explaining SAE features · 2024-08-06T08:43:03.225Z · LW · GW

Nice post, awesome work and very well presented! I'm also working on similar stuff (using ~selfIE to make the model reason about its own internals) and was wondering, did you try to patch the SAE features 3 times instead of one (xxx instead of x)? This is one of the tricks they use in selfIE.

Comment by Clément Dumas (butanium) on Self-explaining SAE features · 2024-08-06T08:17:32.689Z · LW · GW

It should be self-similarity instead of self-explanation here, right?

Comment by Clément Dumas (butanium) on Finding the estimate of the value of a state in RL agents · 2024-06-13T10:13:08.090Z · LW · GW

We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO. Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pretraining, and could thus benefit from the related literature.

To start we sample multiple dataset of trajectories (incl. rewards) by letting the policy and noisy versions thereof interact with the environment.
Compute activations for each state in the trajectories.
Normalise and project respective activations to $m + 1$ value estimates, of the policy and its noisy versions: $v_{θ_{i}} (~ ϕ) = tanh (θ_{i}^{T} {~ ϕ}_{i} + b_{i}) with ~ ϕ (s) = \frac{ϕ (s) - μ}{σ}$
Calculate a consistency loss to be minimised with some of the following terms
1. Mean squared TD error $L_{a} (θ) \propto \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} {[v_{θ} (~ ϕ (s_{t}^{n})) - (R_{t}^{n} + 1 [t < T_{n}] γ v_{θ} (~ ϕ (s_{t + 1}^{n})))]}_{θ}^{2}$
  This term enforces consistency with the Bellman expectation equation. However, in addition to the value function it depends on the use of true reward “labels”.
2. Mean squared error of probe values with trajectory returns $L_{b} (θ) \propto \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} {[v_{θ} (~ ϕ (s_{t}^{n})) - G_{t}^{n}]}_{θ}^{2}$
  This term enforces the definition of the value function, namely it being the expected cumulative reward of the (partial) trajectory. Using this term might be more stable than (a) since it avoids the recurrence relation.
3. Negative variance of probe values $L_{c} (θ) \propto - \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} {[v_{θ} (~ ϕ (s_{t}^{n})) - {¯ v}_{θ}]}_{θ}^{2}$
  This term can help to avoid degenerate loss minimizers, e.g. in the case of sparse rewards.
4. Enforce inequalities between different policy values using learned slack variables $L_{d} (θ, θ_{i}, λ_{i}) \propto \sum_{s} \sum_{i \in {1, m}} (v_{θ} (s) - v_{θ_{i}} (s) - σ_{λ_{i}} (s)^{2})^{2}$
  This term ensures that the policy consistently dominates its noisy versions and is completely unsupervised.
Train the linear probes using the training trajectories
Evaluate on held out test trajectories by comparing the value function to the actual returns. If the action space is simple enough, use the value function to plan in the environment and compare resulting behaviour to that of the policy.

Comment by Clément Dumas (butanium) on Finding the estimate of the value of a state in RL agents · 2024-06-13T10:12:01.952Z · LW · GW

Thanks for your comment! Re: artificial data, agreed that would be a good addition.

Sorry for the gifs maybe I should have embedded YouTube videos instead

Re: middle layer, We actually probed on the middle layers but the "which side the ball is / which side the ball is approaching" features are really salient here.

Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I'll send his notes in an extra comment.

Comment by Clément Dumas (butanium) on Towards Developmental Interpretability · 2024-05-14T17:53:53.709Z · LW · GW

As explained by Sumio Watanabe (

This link is rotten, maybe link to its personal page instead ?
https://sites.google.com/view/sumiowatanabe/home

Comment by Clément Dumas (butanium) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-05-01T14:48:06.271Z · LW · GW

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

Comment by Clément Dumas (butanium) on Mechanistically Eliciting Latent Behaviors in Language Models · 2024-05-01T13:42:23.387Z · LW · GW

I defined earlier.

This link is broken as it links to the draft in edit mode

Comment by Clément Dumas (butanium) on Refusal in LLMs is mediated by a single direction · 2024-05-01T10:19:21.976Z · LW · GW

I'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?

One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?

Comment by Clément Dumas (butanium) on How well do truth probes generalise? · 2024-03-06T17:22:36.245Z · LW · GW

Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?

Comment by Clément Dumas (butanium) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-16T10:34:16.621Z · LW · GW

You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video

Comment by Clément Dumas (butanium) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-14T00:58:51.423Z · LW · GW

Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?

Comment by Clément Dumas (butanium) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-21T12:52:36.549Z · LW · GW

I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.

Comment by Clément Dumas (butanium) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T01:04:11.129Z · LW · GW

Let's assume the prompt template is Q [true/false] [banana/shred]

If I understand correctly, they don't claim $p$ learned has_banana but $~ p = \frac{p (x ⁺) + (1 - p (x ⁻))}{2}$ learned has_banana. Moreover evaluating $~ p$ for $p = is_true (x) \oplus is_shred (x)$ gives:

$~ p (x = Q [?] banana) = \frac{p (Q true banana) + (1 - p (Q false banana))}{2} = \frac{1 + (1 - 0)}{2} = 1$

$~ p (x = Q [?] shred) = \frac{p (Q true shred) + (1 - p (Q false shred))}{2} = \frac{0 + (1 - 1)}{2} = 0$

Therefore, we can learn a $~ p$ that is a banana classifier

Comment by Clément Dumas (butanium) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T00:47:50.341Z · LW · GW

Comment by Clément Dumas (butanium) on Incidental polysemanticity · 2023-11-15T13:54:57.447Z · LW · GW

Small typo in ## Interference arbiters collisions between features

by taking aninner productt with .

Comment by Clément Dumas (butanium) on Aspiration-based Q-Learning · 2023-10-28T18:07:33.928Z · LW · GW

Hi Nathan, I'm not sure if I understand your critique correctly. The algorithm we describe does not try to "maximize the expected likelihood of harvesting X apples". It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?

User info

Posts

Comments