jacob_drori

Posts
Comments

Posts

There is a globe in your LLM 2024-10-08T00:43:40.300Z

Domain-specific SAEs 2024-10-07T20:15:38.584Z

Open Source Automated Interpretability for Sparse Autoencoder Features 2024-07-30T21:11:36.866Z

Comments

Comment by jacob_drori (jacobcd52) on Memorization-generalization in practice · 2025-03-27T17:14:03.023Z · LW · GW

Vary temperature t and measure the resulting learning coefficient function

This confuses me. IIUC, $p_{β, n} (w) = \frac{φ (w) exp (- n β L_{n} (w))}{\int d w^{'} φ (w^{'}) exp (- n β L_{n} (w^{'}))}$ . So changing temperature is equivalent to rescaling the loss by a constant. But such a rescaling doesn't affect the LLC.

What did I misunderstand?

Comment by jacob_drori (jacobcd52) on Estimating the Probability of Sampling a Trained Neural Network at Random · 2025-03-07T16:26:29.397Z · LW · GW

Let be volume of a behavioral region at cutoff $ϵ$ . Your behavioral LLC at finite noise scale is $λ (ϵ) := d log V / d log ϵ$ , which is invariant under rescaling $V$ by a constant. This information about the overall scale of $V$ seems important. What's the reason for throwing it out in SLT?

Comment by jacob_drori (jacobcd52) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-27T06:03:46.267Z · LW · GW

Fantastic research! Any chance you'll open-source weights of the insecure qwen model? This would be useful for interp folks.

Comment by jacob_drori (jacobcd52) on [PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations · 2025-02-26T22:03:09.001Z · LW · GW

The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.

This would be very cool if true, but I think further experiments are needed to support it.

Imagine a dumb scenario where during training, all that happens to the MLP is that it "gets smaller", so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis -- checking the number of elements above a threshold -- would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn't affect the sparsity of the computation it implements.

To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian's elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren't perfect, since they aren't invariant under linear transformations of the model's activations:

E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F' which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G' which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model's activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn't change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn't depend on a choice of basis for the residual stream.

I'll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.

Comment by jacob_drori (jacobcd52) on Against blanket arguments against interpretability · 2025-01-23T17:56:09.632Z · LW · GW

We have pretty robust measurements of complexity of algorithms from SLT

This seems overstated. What's the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a "complexity"?

... and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)

Citation?

Comment by jacob_drori (jacobcd52) on The quantum red pill or: They lied to you, we live in the (density) matrix · 2025-01-18T10:59:14.241Z · LW · GW

Same difference

Comment by jacob_drori (jacobcd52) on The quantum red pill or: They lied to you, we live in the (density) matrix · 2025-01-18T10:57:17.452Z · LW · GW

I'd prefer "basis we just so happen to be measuring in". Or "measurement basis" for short.

You could use "pointer variable", but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).

Comment by jacob_drori (jacobcd52) on The quantum red pill or: They lied to you, we live in the (density) matrix · 2025-01-18T01:36:59.075Z · LW · GW

Your use of "pure state" is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.

Comment by jacob_drori (jacobcd52) on The quantum red pill or: They lied to you, we live in the (density) matrix · 2025-01-18T01:33:04.773Z · LW · GW

The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.

Comment by jacob_drori (jacobcd52) on Domain-specific SAEs · 2025-01-05T22:11:36.168Z · LW · GW

Oops, good spot! I meant to write 1 minus that quantity. I've edited the OP.

Comment by jacob_drori (jacobcd52) on Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data · 2024-10-21T05:47:03.641Z · LW · GW

This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I'm sure the answers could be pieced together from the notebook, but most people won't click through and read the code.

Comment by jacob_drori (jacobcd52) on Circuits in Superposition: Compressing many small neural networks into one · 2024-10-14T20:45:05.718Z · LW · GW

Ah, I think I understand. Let me write it out to double-check, and in case it helps others.

Say , for simplicity. Then $A^{l} = \sum_{t} E_{t} a_{t}^{l}$ . This sum has $k$ nonzero terms.

In your construction, $W^{l, i n} = \sum_{t} V_{t}^{l} W_{t}^{l, i n} E_{t}^{T}$ . Focussing on a single neuron, labelled by $i$ , we have $(W^{l, i n})_{i} = \sum_{t} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T}$ . This sum has $\sim p T$ nonzero terms.

So the preactivation of an MLP hidden neuron in the big network is $p_{i}^{l} = \sum_{t, t^{'}} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T} E_{t^{'}} a_{t^{'}}^{l}$ . This sum has $\sim k p T$ nonzero terms.

We only "want" the terms where $t = t^{'}$ ; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the $\sim k p T$ different noise terms are roughly orthogonal, and so the norm of the noise is $O (\sqrt{k p T})$ (times some other factors, but this captures the $T$ -dependence, which is what I was confused about).

Comment by jacob_drori (jacobcd52) on Circuits in Superposition: Compressing many small neural networks into one · 2024-10-14T20:25:54.276Z · LW · GW

Comment by jacob_drori (jacobcd52) on Circuits in Superposition: Compressing many small neural networks into one · 2024-10-14T17:25:27.529Z · LW · GW

I'm confused by the read-in bound:

Sure, each neuron reads from $T \frac{n log M}{M}$ of the random subspaces. But in all but $k$ of those subspaces, the big network's activations are smaller than $δ$ , right? So I was expecting a tighter bound - something like:

$ϵ_{t}^{l, i n} = O (w a \sqrt{(k + T δ) \frac{m d}{M D} log M})$

Comment by jacob_drori (jacobcd52) on tailcalled's Shortform · 2024-09-24T20:02:52.567Z · LW · GW

Ah, so I think you're saying "You've explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level?

This is a great question, and as with any question of the form "why does this property emerge from these basic rules", there's unlikely to be a short answer. E.g. if you said "given our understanding of the standard model, explain how a cell works", I'd have to reply "uhh, get out a pen and paper and get ready to churn through equations for several decades".

In this case, one might be able to point to a few key points that tell the rough story. You'd want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means "one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric"). I imagine that, generically, these solutions behave differently with respect to the "1" direction and the "3" directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can't be more specific!

Comment by jacob_drori (jacobcd52) on tailcalled's Shortform · 2024-09-24T18:41:59.661Z · LW · GW

> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?

Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ.

In ordinary quantum mechanics, time and space are treated very differently: is a coordinate whereas $x$ is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how $x$ evolves as a function of $t$ .

But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field $ϕ (x, t)$ ) evolves as a function of $x$ and $t$ . Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I'm glossing over here, but I wouldn't be surprised if they ultimately stem from this one).

All of this is to say: our best theory of how nature works (QFT), is neither formulated as "energy-first" nor as "momentum-first". Instead, energy and momentum are on fairly equal footing.

Comment by jacob_drori (jacobcd52) on tailcalled's Shortform · 2024-09-24T18:20:56.270Z · LW · GW

Comment by jacob_drori (jacobcd52) on tailcalled's Shortform · 2024-09-21T01:20:34.724Z · LW · GW

Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren't transferred between objects at the everyday, macro level we humans are used to.

E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn't very useful in day-to-day life.

E.g. 2: conservation of color charge doesn't really say anything useful about everyday processes, since it's only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones).

The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And... momentum seems roughly as important as energy?

I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you're interested, I can answer in a separate comment.

Comment by jacob_drori (jacobcd52) on Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics? · 2024-08-30T19:21:22.562Z · LW · GW

I'll just answer the physics question, since I don't know anything about cellular automata.

When you say time-reversal symmetry, do you mean that t -> T-t is a symmetry for any T?

If so, the composition of two such transformations is a time-translation, so we automatically get time-translation symmetry, which implies the 1st law.

If not, then the 1st law needn't hold. E.g. take any time-dependent hamiltonian satisfying H(t) = H(-t). This has time-reversal symmetry about t=0, but H is not conserved.

Comment by jacob_drori (jacobcd52) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2024-06-12T20:10:29.147Z · LW · GW

The theorem guarantees the existence of a -dimensional analytic manifold $M$ and a real analytic map

g : M ∋ u \mapsto w \in W

such that for each coordinate $M_{α}$ of $M$ one can write

\begin{matrix} K (g (u)) & = u_{1}^{2 k_{1}} \dots u_{d}^{2 k_{d}} . . . \end{matrix}

I'm a bit confused here. First, I take it that $α$ labels coordinate patches? Second, consider the very simple case with $d = 2$ and $K (w) = w_{1}^{2} + w_{2}^{2}$ . What $g$ would put $K$ into the stated form?

Comment by jacob_drori (jacobcd52) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-26T00:00:08.730Z · LW · GW

Nice work! I'm not sure I fully understand what the "gated-ness" is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup:

Let and $^x$ be the encoder and decoder functions, as in your paper, and let $x$ be the model activation that is fed into the SAE.

The usual SAE reconstruction is $^x (f (x))$ , which suffers from the shrinkage problem.

Now, introduce a new learned parameter $t \in R^{n_{f e a t u r e s}}$ , and define an "expanded" reconstruction $y_{e x p a n d e d} =^x (t ⊙ f (x))$ , where $⊙$ denotes elementwise multiplication.

Finally, take the loss to be:

$L = | | {^x}_{c o p y} (f (x)) - x | |_{2}^{2} + | | y_{e x p a n d e d} - x | |_{2}^{2} + λ | | f (x) | |_{1}$ .

where ${^x}_{c o p y}$ ensures the decoder gets no gradients from the first term. As I understand it, this is exactly the loss appearing in your paper. The only difference in the setup is the lack of the Heaviside step function.

Did you try this setup? Or does it fail for an obvious reason I missed?

Comment by jacob_drori (jacobcd52) on Do sparse autoencoders find "true features"? · 2024-03-05T16:22:08.728Z · LW · GW

The peaks at 0.05 and 0.3 are strange. What regulariser did you use? Also, could you check whether all features whose nearest neighbour has cosine similarity 0.3 have the same nearest neighbour (and likewise for 0.05)?

Comment by jacob_drori (jacobcd52) on Some costs of superposition · 2024-03-04T18:57:30.683Z · LW · GW

The typical noise on feature caused by 1 unit of activation from feature $f_{2}$ , for any pair of features $f_{1}$ , $f_{2}$ , is (derived from Johnson–Lindenstrauss lemma)
$ϵ = \sqrt{\frac{8 ln (m)}{n}}$ ^[1]

1. ... This is a worst case scenario. I have not calculated the typical case, but I expect it to be somewhat less, but still same order of magnitude

Perhaps I'm misunderstanding your claim here, but the "typical" (i.e. RMS) inner product between two independently random unit vectors in $R^{n}$ is $n^{- 1 / 2}$ . So I think the $\sqrt{8 ln m}$ shouldn't be there, and the rest of your estimates are incorrect.

This means that we can have at most $l < n / (8 ln (m))$ simultaneously active features

This conclusion gets changed to $l < n$ .

Comment by jacob_drori (jacobcd52) on Attention SAEs Scale to GPT-2 Small · 2024-02-28T18:10:47.197Z · LW · GW

Paging hijohnnylin -- it'd be awesome to have neuronpedia dashboards for these features. Between these, OpenAI's MLP features, and Joseph Bloom's resid_pre features, we'd have covered pretty much the whole model!

Comment by jacob_drori (jacobcd52) on Do sparse autoencoders find "true features"? · 2024-02-27T20:24:27.983Z · LW · GW

For each SAE feature (i.e. each column of W_dec), we can look for a distinct feature with the maximum cosine similarity to the first. Here is a histogram of these max cos sims, for Joseph Bloom's SAE trained at resid_pre, layer 10 in gpt2-small. The corresponding plot for random features is shown for comparison:

The SAE features are much less orthogonal than the random ones. This effect persists if, instead of the maximum cosine similarity, we look at the 10th largest, or the 100th largest:

I think it's a good idea to include a loss term to incentivise feature orthogonality.

Comment by jacob_drori (jacobcd52) on Do sparse autoencoders find "true features"? · 2024-02-26T19:18:07.848Z · LW · GW

Nice, this is exactly what I was asking for. Thanks!

Comment by jacob_drori (jacobcd52) on Do sparse autoencoders find "true features"? · 2024-02-23T18:13:28.824Z · LW · GW

I'm confused about your three-dimensional example and would appreciate more mathematical detail.

Call the feature directions f1, f2, f3.

Suppose SAE hidden neurons 1,2 and 3 read off the components along f1, f2, and f1+f2, respectively. You claim that in some cases this may achieve lower L1 loss than reading off the f1, f2, f3 components.

[note: the component of a vector X along f1+f2 here refers to 1/2 * (f1+f2) \cdot X]

Can you write down the encoder biases that would achieve this loss reduction? Note that e.g. when the input is f1, there is a component of 1/2 along f1+f2, so you need a bias < -1/2 on neuron 3 to avoid screwing up the reconstruction.

Comment by jacob_drori (jacobcd52) on Mapping the semantic void II: Above, below and between token embeddings · 2024-02-16T17:47:31.437Z · LW · GW

Nice post. I was surprised that the model provides the same nonsense definition regardless of the token when the embedding is rescaled to be large, and moreover that this nonsense definition is very similar to the one given when the embedding is rescaled to be small. Here's an explanation I find vaguely plausible. Suppose the model completes the task as follows:

The model sees the prompt 'A typical definition of <token> would be '.
At some attention head A1, the <token> position attends back to 'definition' and gains a component in the residual stream direction that represents the I am the token being defined feature.
At some later attention head A2, the final position of the prompt attends back to positions with the I am the token being defined feature, and moves whatever information from that position is needed for defining the corresponding token.

Now, suppose we rescale the <token> embedding to be very large. The size of the I am the token being defined component moved to the <token> position by A1 stays roughly the same as before (since no matter how much we scale query vectors, attention probabilities can never exceed 1). So, as a fraction of the total norm of the residual stream at that position, we've made the I am the token being defined component a lot smaller.

Then, when the residual stream is fed into the layernorm preceding A2, the I am the token being defined component gets squashed down to almost zero: it has been "squeezed out" by the very large token embedding. Hence, when the QK matrix of A2 looks for positions with the I am the token being defined feature, it finds nothing, and all the model can do is give some generic nonsense definition. Unsurprisingly, this nonsense definition ends up being pretty similar to the one given when the token embedding is sent to zero, since in both cases the model is essentially trying to define a token that isn't there.

The details of this explanation may be totally wrong, and I haven't checked any of this. But my guess is that something roughly along these lines is correct.

Comment by jacob_drori (jacobcd52) on Internal independent review for language model agent alignment · 2023-12-28T22:51:40.877Z · LW · GW

I hope that type of learning isn't used

I share your hope, but I'm pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.

The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer's decision never influences the outer-loop weights. Correct me if I'm wrong here.

I'm glad you plan to address this in a future post, and I look forward to reading it.

Comment by jacob_drori (jacobcd52) on Internal independent review for language model agent alignment · 2023-12-25T23:32:08.266Z · LW · GW

I'm a little confused. What exactly is the function of the independent review, in your proposal? Are you imagining that the independent alignment reviewer provides some sort of "danger" score which is added to the loss? Or is the independent review used for some purpose other than providing a gradient signal?

Comment by jacob_drori (jacobcd52) on Finding Sparse Linear Connections between Features in LLMs · 2023-12-12T19:20:11.753Z · LW · GW

I'm slightly confused about the setup. In the following, what spaces is W mapping between?

Linear:

At first I expected W : R^{d_model} -> R^{d_model}. But then it wouldn't make sense to impose a sparsity penalty on W.

In other words: what is the shape of the matrix W?

Comment by jacobcd52 on [deleted post] 2023-12-05T20:20:35.302Z

Is your issue just "Alice's first sentence is so misguided that no self-respecting safety researcher would say such a thing"? If so, I can edit to clarify the fact that this is a deliberate strawman, which Bob rightly criticises. Indeed:

Bob: I'm asking you why models should misgeneralise in the extremely specific weird way that you mentioned

expresses a similar sentiment to Reward Is Not the Optimization Target: one should not blindly assume that models will generalise OOD to doing things that look like "maximising reward". This much is obvious by the example of individual humans not maximising inclusive genetic fitness.

But, as noted in the comments on Reward Is Not the Optimization Target, it seems plausible that some models really do learn at least some behaviours that are more-or-less what we'd naively expect from a reward-maximiser. E.g. Paul Christiano writes:

If you have a system with a sophisticated understanding of the world, then cognitive policies like "select actions that I expect would lead to reward" will tend to outperform policies like "try to complete the task," and so I usually expect them to be selected by gradient descent over time.

The purpose of Alice's thought experiment is precisely to give such an example, where a deployed model quite plausibly displays the sort of reward-maximiser behaviour one might've naively expected (in this case, power-seeking).

Comment by jacobcd52 on [deleted post] 2023-12-01T21:59:07.782Z

Regarding 3, yeah, I definitely don't want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.

Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I'm pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.

Comment by jacob_drori (jacobcd52) on Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper · 2023-10-26T22:58:43.621Z · LW · GW

Define the "frequent neurons" of the hidden layer to be those that fire with frequency > 1e-4. The image of this set of neurons under W_dec forms a set of vectors living in R^d_mlp, which I'll call "frequent features".

These frequent features are less orthogonal than I'd naively expect.

If we choose two vectors uniformly at random on the (d_mlp)-sphere, their cosine sim has mean 0 and variance 1/d_mlp = 0.0005. But in your SAE, the mean cosine sim between distinct frequent features is roughly 0.0026, and the variance is 0.002.

So the frequent features have more cosine similarity than you'd get by just choosing a bunch of directions at random on the (d_mlp)-sphere. This effect persists even when you throw out the neuron-sparse features (as per your top10 definition).

Any idea why this might be the case? My previous intuition had been that transformers try to pack in their features as orthogonally as possible, but it looks like I might've been wrong about this. I'd also be interested to know if a similar effect is also found in the residual stream, or if it's entirely due to some weirdness with relu picking out a preferred basis for the mlp hidden layer.

User info

Posts

Comments