What’s up with LLMs representing XORs of arbitrary features? 2024-01-03T19:44:33.162Z
Some open-source dictionaries and dictionary learning infrastructure 2023-12-05T06:05:21.903Z
Thoughts on open source AI 2023-11-03T15:35:42.067Z
Turning off lights with model editing 2023-05-12T20:25:12.353Z
[Crosspost] ACX 2022 Prediction Contest Results 2023-01-24T06:56:33.101Z
AGISF adaptation for in-person groups 2023-01-13T03:24:58.320Z
Update on Harvard AI Safety Team and MIT AI Alignment 2022-12-02T00:56:45.596Z
Recommend HAIST resources for assessing the value of RLHF-related alignment research 2022-11-05T20:58:06.511Z
Caution when interpreting Deepmind's In-context RL paper 2022-11-01T02:42:06.766Z
Safety considerations for online generative modeling 2022-07-07T18:31:19.316Z
Proxy misspecification and the capabilities vs. value learning race 2022-05-16T18:58:24.044Z
If you’re very optimistic about ELK then you should be optimistic about outer alignment 2022-04-27T19:30:11.785Z
Sam Marks's Shortform 2022-04-13T21:38:26.871Z
2022 ACX predictions: market prices 2022-03-06T06:24:42.908Z
Movie review: Don't Look Up 2022-01-04T20:16:04.593Z
[Book review] Gödel, Escher, Bach: an in-depth explainer 2021-09-29T19:03:20.234Z
For mRNA vaccines, is (short-term) efficacy really higher after the second dose? 2021-04-25T20:21:59.349Z


Comment by Sam Marks (samuel-marks) on How well do truth probes generalise? · 2024-02-25T20:05:14.137Z · LW · GW

I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre).

I found that the PCA plot for 7B for larger_than and smaller_than individually looked similar to that for 13B, but that the PCA plot for larger_than + smaller_than looked degenerate in the way I screenshotted. Are you saying that your larger_than + smaller_than PCA looked familiar for 7B?

I suppose there are two things we want to separate: "truth" from likely statements, and "truth" from what humans think (under some kind of simulacra framing).  I think this approach would allow you to do the former, but not the latter.  And to be honest, I'm not confident on TruthfulQA's ability to do the latter either.

Agreed on both points.

We differ slightly from the original GoT paper in naming, and use got_cities to refer to both the cities and neg_cities datasets. The same is true for sp_en_trans and larger_than. We don't do this for cities_cities_{conj,disj} and leave them unpaired.

Thanks for clarifying! I'm guessing this is what's making the GoT datasets much worse for generalization (from and to) in your experiments. For 13B, it mostly seemed to me that training on negated statements helped for generalization to other negated statements, and that pairing negated statements with unnegated statements in training data usually (but not always) made generalization to unnegated datasets a bit worse. (E.g. the cities -> sp_en_trans generalization is better than cities + neg_cities -> sp_en_trans generalization.)

Comment by Sam Marks (samuel-marks) on How well do truth probes generalise? · 2024-02-25T13:40:54.637Z · LW · GW

Very cool! Always nice to see results replicated and extended on, and I appreciated how clear you were in describing your experiments.

Do smaller models also have a generalised notion of truth?

In my most recent revision of GoT[1] we did some experiments to see how truth probe generalization changes with model scale, working with LLaMA-2-7B, -13B, and -70B. Result: truth probes seems to generalize better for larger models. Here are the relevant figures.

Some other related evidence from our visualizations:

We summed things up like so, which I'll just quote in its entirety:

Overall, these visualizations suggest a picture like the following: as LLMs scale (and perhaps, also as a fixed LLM progresses through its forward pass), they hierarchically develop and linearly represent increasingly general abstractions. Small models represent surface-level characteristics of their inputs; these surface-level characteristics may be sufficient for linear probes to be accurate on narrow training distributions, but such probes are unlikely to generalize out-of-distribution. Large models linearly represent more abstract concepts, potentially including abstract notions like “truth” which capture shared properties of topically and structurally diverse inputs. In middle regimes, we may find linearly represented concepts of intermediate levels of abstraction, for example, “accurate factual recall” or “close association” (in the sense that “Beijing” and “China” are closely associated). These concepts may suffice to distinguish true/false statements on individual datasets, but will only generalize to test data for which the same concepts


How do we know we’re detecting truth, and not just likely statements?

One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the "neg_" versions of the datasets from GoT, containing negated statements like "The city of Beijing is not in China." For these datasets, the correlation between truth value and likelihood (operationalized as LLaMA-2-70B's log probability of the full statement) is strong and negative (-0.63 for neg_cities and -.89 for neg_sp_en_trans). But truth probes still often generalize well to these negated datsets. Here are results for LLaMA-2-70B (the horizontal axis shows the train set, and the vertical axis shows the test set).

We also find that the probe performs better than LDA in-distribution, but worse out-of-distribution:

Yep, we found the same thing -- LDA improves things in-distribution, but generalizes work than simple DIM probes.

Why does got_cities_cities_conj generalise well?

I found this result surprising, thanks! I don't really have great guesses for what's going on. One thing I'll say is that it's worth tracking differences between various sorts of factual statements. For example, for LLaMA-2-13B it generally seemed to me that there was better probe transfer between factual recall datasets (e.g. cities and sp_en_trans, but not larger_than). I'm not really sure why the conjunctions are making things so much better, beyond possibly helping to narrow down on "truth" beyond just "correct statement of factual recall." 

I'm not surprised that cities_cities_conj and cities_cities_disj are so qualitatively different -- cities_cities_disj has never empirically played well with the other datasets (in the sense of good probe transfer) and I don't really know why. 


  1. ^

    This is currently under review, but not yet on arxiv, sorry about that! Code in the nnsight branch here. I'll try to come back to add a link to the paper once I post it or it becomes publicly available on OpenReview, whichever happens first.

Comment by Sam Marks (samuel-marks) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-09T23:47:33.882Z · LW · GW

This comment is about why we were getting different MSE numbers. The answer is (mostly) benign -- a matter of different scale factors. My parallel comment, which discusses why we were getting different CE diff numbers is the more important one.

When you compute MSE loss between some activations  and their reconstruction , you divide by variance of , as estimated from the data in a batch. I'll note that this doesn't seem like a great choice to me. Looking at the resulting training loss:

where  is the encoding of  by the autoencoder and  is the L1 regularization constant, we see that if you scale  by some constant , this will have no effect on the first term, but will scale the second term by . So if activations generically become larger in later layers, this will mean that the sparsity term becomes automatically more strongly weighted.

I think a more principled choice would be something like

where we're no longer normalizing by the variance, and are also using sqrt(MSE) instead of MSE. (This is what the dictionary_learning repo does.) When you scale  by a constant , this entire expression scales by a factor of , so that the balance between reconstruction and sparsity remains the same. (On the other hand, this will mean that you might need to scale the learning rate by , so perhaps it would be reasonable to divide through this expression by ? I'm not sure.)

Also, one other thing I noticed: something which we both did was to compute MSE by taking the mean over the squared difference over the batch dimension and the activation dimension. But this isn't quite what MSE usually means; really we should be summing over the activation dimension and taking the mean over the batch dimension. That means that both of our MSEs are erroneously divided by a factor of the hidden dimension (768 for you and 512 for me).

This constant factor isn't a huge deal, but it does mean that:

  1. The MSE losses that we're reporting are deceptively low, at least for the usual interpretation of "mean squared error"
  2. If we decide to fix this, we'll need to both scale up our L1 regularization penalty by a factor of the hidden dimension (and maybe also scale down the learning rate).

This is a good lesson on how MSE isn't naturally easy to interpret and we should maybe just be reporting percent variance explained. But if we are going to report MSE (which I have been), I think we should probably report it according to the usual definition.

Comment by Sam Marks (samuel-marks) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-09T23:18:20.695Z · LW · GW

Yep, as you say, @Logan Riggs figured out what's going on here: you evaluated your reconstruction loss on contexts of length 128, whereas I evaluated on contexts of arbitrary length. When I restrict to context length 128, I'm able to replicate your results.

Here's Logan's plot for one of your dictionaries (not sure which)

and here's my replication of Logan's plot for your layer 1 dictionary

Interestingly, this does not happen for my dictionaries! Here's the same plot but for my layer 1 residual stream output dictionary for pythia-70m-deduped

(Note that all three plots have a different y-axis scale.)

Why the difference? I'm not really sure. Two guesses:

  1. The model: GPT2-small uses learned positional embeddings whereas Pythia models use rotary embeddings
  2. The training: I train my autoencoders on variable-length sequences up to length 128; left padding is used to pad shorter sequences up to length 128. Maybe this makes a difference somehow.

In terms of standardization of which metrics to report, I'm torn. On one hand, for the task your dictionaries were trained on (reconstruction activations taken from length 128 sequences), they're performing well and this should be reflected in the metrics. On the other hand, people should be aware that if they just plug your autoencoders into GPT2-small and start doing inference on inputs found in the wild, things will go off the rails pretty quickly. Maybe the answer is that CE diff should be reported both for sequences of the same length used in training and for arbitrary-length sequences?

Comment by Sam Marks (samuel-marks) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-07T22:42:43.668Z · LW · GW

My SAEs also have a tied decoder bias which is subtracted from the original activations. Here's the relevant code in

def encode(self, x):
        return nn.ReLU()(self.encoder(x - self.bias))
    def decode(self, f):
        return self.decoder(f) + self.bias
    def forward(self, x, output_features=False, ghost_mask=None):
            f = self.encode(x)
            x_hat = self.decode(f)
            return x_hat

Note that I checked that our SAEs have the same input-output behavior in my linked colab notebook. I think I'm a bit confused why subtracting off the decoder bias had to be done explicitly in your code -- maybe you used dictionary.encoder and dictionary.decoder instead of dictionary.encode and dictionary.decode? (Sorry, I know this is confusing.) ETA: Simple things I tried based on the hypothesis "one of us needs to shift our inputs by +/- the decoder bias" only made things worse, so I'm pretty sure that you had just initially converted my dictionaries into your infrastructure in a way that messed up the initial decoder bias, and therefore had to hand-correct it.

I note that the MSE Loss you reported for my dictionary actually is noticeably better than any of the MSE losses I reported for my residual stream dictionaries! Which layer was this? Seems like something to dig into.

Comment by Sam Marks (samuel-marks) on Some open-source dictionaries and dictionary learning infrastructure · 2024-02-07T22:33:12.461Z · LW · GW

At the time that I made this post, no, but this has been implemented in dictionary_learning since I saw your suggestion to do so in your linked post.

Comment by Sam Marks (samuel-marks) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-07T01:52:00.548Z · LW · GW

Another sanity check: when you compute CE loss using the same code that you use when computing CE loss when activations are reconstructed by the autoencoders, but instead of actually using the autoencoder you just plug the correct activations back in, do you get the same answer (~3.3) as when you evaluate CE loss normally?

Comment by Sam Marks (samuel-marks) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-07T01:34:17.240Z · LW · GW

In the notebook I link in my original comment, I check that the activations I get out of nnsight are the same as the activations that come from transformer_lens. Together with the fact that our sparsity statistics broadly align, I'm guessing that the issue isn't that I'm extracting different activations than you are.

Repeating my replication attempt with data from OpenWebText, I get this:

Layer MSE Loss % Variance Explained L1 L0 % Alive CE Reconstructed
1 0.069 95 40 15 46 6.45
7 0.81 86 125 59.2 96 4.38

Broadly speaking, same story as above, except that the MSE losses look better (still not great), and that the CE reconstructed looks very bad for layer 1.

I don't much padding at all, that might be a big difference too.

Seems like there was a typo here -- what do you mean?

Logan Riggs reports that he tried to replicate your results and got something more similar to you. I think Logan is making decisions about padding and tokenization more like the decisions you make, so it's possible that the difference is down to something around padding and tokenization.

Possible next steps:

  • Can you report your MSE Losses (instead of just variance explained)?
  • Can you try to evaluate the residual stream dictionaries in the 5_32768 set released here? If you get CE reconstructed much better than mine, then it means that we're computing CE reconstructed in different ways, where your way consistently reports better numbers. If you get CE reconstructed much worse than mine, then it might mean that there's a translation error between our codebases (e.g. using different activations).
Comment by Sam Marks (samuel-marks) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-06T07:55:10.447Z · LW · GW

I tried replicating your statistics using my own evaluation code (in here). I pseudo-randomly chose layer 1 and layer 7. Sadly, my results look rather different from yours:

Layer MSE Loss % Variance Explained L1 L0 % Alive CE Reconstructed
1 0.11 92 44 17.5 54 5.95
7 1.1 82 137 65.4 95 4.29

Places where our metrics agree: L1 and L0.

Places where our metrics disagree, but probably for a relatively benign reason:

  • Percent variance explained: my numbers are slightly lower than yours, and from a brief skim of your code I think it's because you're calculating variance slightly incorrectly: you're not subtracting off the activation's mean before doing .pow(2).sum(-1). This will slightly overestimate the variance of the original activations, so probably also overestimate percent variance explained.
  • Percent alive: my numbers are slightly lower than yours, and this is probably because I determined whether neurons are alive on a (somewhat small) batch of 8192 tokens. So my number is probably an underestimate and yours is correct.

Our metrics disagree strongly on CE reconstructed, and this is a bit alarming. It means that either you have a bug which significantly underestimates reconstructed CE loss, or I have a bug which significantly overestimates it. I think I'm 50/50 on which it is. Note that according to my stats, your MSE loss is kinda bad, which would suggest that you should also have high CE reconstructed (especially when working with residual stream dictionaries! (in contrast to e.g. MLP dictionaries which are much more forgiving)).

Spitballing a possible cause: when computing CE loss, did you exclude padding tokens? If not, then it's possible that many of the tokens on which you're computing CE are padding tokens, which is artificially making your CE look extremely good.

Here is my code. You'll need to pip install nnsight before running it. Many thanks to Caden Juang for implementing the UnifiedTransformer functionality in nnsight, which is a crazy Frankenstein marriage of nnsight and transformer_lens; it would have been very hard for me to attempt this replication without this feature.

Comment by Sam Marks (samuel-marks) on Sam Marks's Shortform · 2024-02-06T06:56:29.601Z · LW · GW

Some updates about the dictionary_learning repo:

  • The repo now has support for ghost grads. h/t g-w1 for submitting a PR for this
  • ActivationBuffers now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).
  • ActivationBuffers can now be stored on the GPU.
  • The file contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating dictionaries people send to me.
  • New convenience: you can do reconstructed_acts, features = dictionary(acts, output_features=True) to get both the reconstruction and the features computed by dictionary.

Also, if you'd like to train dictionaries for many model components in parallel, you can use the parallel branch. I don't promise to never make breaking changes to the parallel branch, sorry.

Finally, we've released a new set of dictionaries for the MLP outputs, attention outputs, and residual stream in all layers of Pythia-70m-deduped. The MLP and attention dictionaries seem pretty good, and the residual stream dictionaries seem like a mixed bag. Their stats can be found here.

Comment by Sam Marks (samuel-marks) on TurnTrout's shortform feed · 2024-01-20T02:29:45.864Z · LW · GW

Thanks, I've disliked the shoggoth meme for a while, and this post does a better job articulating why than I've been able to do myself.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-12T19:07:53.960Z · LW · GW

Imo "true according to Alice" is nowhere near as "crazy" a feature as "has_true XOR has_banana". It seems useful for the LLM to model what is true according to Alice! (Possibly I'm misunderstanding what you mean by "crazy" here.)

I agree with this! (And it's what I was trying to say; sorry if I was unclear.) My point is that 
{ features which are as crazy as "true according to Alice" (i.e., not too crazy)} 
seems potentially manageable, where as 
{ features which are as crazy as arbitrary boolean functions of other features } 
seems totally unmanageable.

Thanks, as always, for the thoughtful replies.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-12T00:12:08.578Z · LW · GW

Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not".

I agree that "the model has learned the algorithm 'always compute XORs with has_not'" is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of "clearly not useful XORs" I was thinking of has_true XOR has_banana, where I'm guessing you're anticipating that this XOR exists incidentally.

If you want you could rephrase this issue as " and  are spuriously correlated in training," so I guess I should say "even in the absence of spurious correlations among basic features."

... That's exactly how I would rephrase the issue and I'm not clear on why you're making a sharp distinction here.

Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize:

  1. player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both
  2. monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both.

These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist ... I think I mostly don't need to worry about (2), but I'm not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!)

I agree that you can think of this issue as just being the consequence of the two issues "there are lots of crazy XOR features" and "linear probes can pick up on spurious correlations," I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn't matter (because the crazy XOR directions aren't too salient) or come up with some new idea.

I'll note that if it ends up these XOR directions don't matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).[1]

my main claim is that it shouldn't be surprising

If I had to articulate my reason for being surprised here, it'd be something like:

  1. I didn't expect LLMs to compute many XORs incidentally
  2. I didn't expect LLMs to compute many XORs because they are useful

but lots of XORs seem to get computed anyway. So at least one of these two mechanisms is occurring a surprising (to me) amount. If there's a lot more incidental computation, then why? (Based on Fabian's experiments, maybe the answer is "there's more redundancy than I expected," which would be interesting.) If there's a lot more intentional computation of XORs than I expected, then why? (I've found the speculation that LLMs might just computing a bunch of XORs up front because they don't know what they'll need later interesting.) I could just update my world model to "lots of XORs exist for either reasons (1) or (2)," but I sure would be interested in knowing which of (1) or (2) it is and why.


  1. ^

    I know that for CCS you're more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-11T16:11:02.622Z · LW · GW

I agree with a lot of this, but some notes:

Exponentially many features


On utility explanations, you would expect that multi-way XORs are much less useful for getting low loss than two-way XORs, and so computation for multi-way XORs is never developed.

The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything. So I think any utility explanation that's going to be correct needs to be a somewhat subtle one of the form "the model doesn't initially know which XORs will be useful, so it just dumbly computes way more XORs than it needs, including XORs which are never used in any example in training." Or in other words "the model has learned the algorithm 'compute lots of XORs' rather than having learned specific XORs which it's useful to compute."

I think this subtlety changes the story a bit. One way that it changes the story is that you can't just say "the model won't compute multi-way XORs because they're not useful" -- the two-way XORs were already not useful! You instead need to argue that the model is implementing an algorithm which computed all the two-way XORs but didn't compute XORs of XORs; it seems like this algorithm might need to encode somewhere information about which directions correspond to basic features and which don't.

On the other hand, RAX introduces a qualitatively new way that linear probes can fail to learn good directions. Suppose a is a feature you care about (e.g. “true vs. false statements”) and b is some unrelated feature which is constant in your training data (e.g. b = “relates to geography”). [...]

Fwiw, failures like this seem plausible without RAX as well. We explicitly make this argument in our goal misgeneralization paper (bottom of page 9 / Section 4.2), and many of our examples follow this pattern (e.g. in Monster Gridworld, you see a distribution shift from "there is almost always a monster present" in training to "there are no monsters present" at test time).

Even though on a surface level this resembles the failure discussed in the post (because one feature is held fixed during training), I strongly expect that the sorts of failures you cite here are really generalization failure for "the usual reasons" of spurious correlations during training. For example, during training (because monsters are present), "get a high score" and "pick up shields" are correlated, so the agents learn to value picking up shields. I predict that if you modified the train set so that it's no longer useful to pick up shields (but monsters are still present), then the agent would no longer pick up shields, and so would no longer misgeneralize in this particular way.

In contrast, the point I'm trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.[1]

I don't think the model has to do any active tracking; on both hypotheses this happens by default (in incidental explanations, because of the decay postulate, and in utility explanations, because the  feature is less useful and so fewer resources go towards computing it).

As you noted, it will sometimes be the case that XOR features are more like basic features than derived features, and thus will be represented with high salience. I think incidental hypotheses will have a really hard time explaining this -- do you agree?

For utility hypotheses, the point is that there needs to be something different in model internals which says "when computing these features represent the result with low salience, but when computing these features represent the result with high salience." Maybe on your model this is something simple like the weights computing the basic features being larger than weights computing derived features? If so, that's the tracking I'm talking about, and is a potential thread to pull on for distinguishing basic vs. derived features using model internals.


  1. ^

    If you want you could rephrase this issue as " and  are spuriously correlated in training," so I guess I should say "even in the absence of spurious correlations among basic features."

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-08T21:58:34.554Z · LW · GW

Thanks, you're totally right about the equal variance thing -- I had stupidly thought that the projection of  onto y = x would be uniform on  (obviously false!).

The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a "more typical" special case would be something like:

  • if a, b are both false, then sample from 
  • if a is true and b is false, then sample from 
  • if a is false and b is true then sample from 
  • if a and b are true, then sample from 

for some  and covariance matrix . In general, I don't really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters.

Given that longer explanation, does the unimodality thing still seem directionally wrong?

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-08T14:44:55.430Z · LW · GW

Thanks, you're correct that my definition breaks in this case. I will say that this situation is a bit pathological for two reasons:

  1. The mode of a uniform distribution doesn't coincide with its mean.
  2. The variance of the multivariate uniform distribution  is largest along the direction , which is exactly the direction which we would want to represent a AND b.

I'm not sure exactly which assumptions should be imposed to avoid pathologies like this, but maybe something of the form: we are working with boolean features  whose class-conditional distributions  satisfy properties like

  •  are unimodal, and their modes coincide with their means
  • The variance of  along any direction is not too large relative to the difference of the means 
Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-08T14:36:04.889Z · LW · GW

Neat hypothesis! Do you have any ideas for how one would experimentally test this?

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-07T16:49:54.886Z · LW · GW

Some features which are computed from other features should probably themselves be treated as basic and thus represented with large salience.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-06T00:28:57.073Z · LW · GW

Using a dataset of 10,000 inputs of the form
[random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
I've rerun the probing experiments. The possible labels are

  • has_true: is the second last token "true" or "false"?
  • has_banana: is the last token "banana" or "shed"?
  • label: is the third last token the most likely or the 100th most likely?

(this weird last option is because I'm adapting a dataset from the Geometry of Truth paper about likely vs. unlikely text).

Here are the results for LLaMA-2-13B

And here are the results for the reset network

I was a bit surprised that the model did so badly on has_true, but in hindsight, considering that the activations are extracted over the last token and "true"/"false" is the penultimate token, this seems fine.

Mostly I view this as a sanity check to make sure that when the dataset is larger we don't get the <<50% probe accuracies. I think to really dig into this more, one would need to do this with features which are not token-level and which are unambiguously linearly accessible (unlike the "label" feature here).

@ryan_greenblatt @abhatt349 @Fabien Roger 

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T11:01:51.187Z · LW · GW

Yes, you are correct, thanks. I'll edit the post when I get a chance.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T10:23:26.984Z · LW · GW

If anyone would like to replicate these results, the code can be found in the rax branch of my geometry-of-truth repo. This was adapted from a codebase I used on a different project, so there's a lot of uneeded stuff in this repo. The important parts here are:

  • The datasets: cities_alice.csv and neg_cities_alice.csv (for the main experiments I describe), cities_distractor.csv and neg_cities_distractor.csv (for the experiments with banana/shed at the end of factual statements), and xor.csv (for the experiments with true/false and banana/shed after random text).
  • xor_probing.ipynb: my code for doing the probing and making the plots. This assumes that the activations have already been extracted and saved using (see the readme for info about how to use

Unless you want to do PCA visualizations, I'd probably recommend just taking my datasets and quickly writing your own code to do the probing experiments, rather than spending time trying to figure out my infrastructure here.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T10:16:39.411Z · LW · GW

Thanks for doing this! Can you share the dataset that you're working with? I'm traveling right now, but when I get a chance I might try to replicate your failed replication on LLaMA-2-13B and with my codebase (which can be found here; see especially xor_probing.ipynb).

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-04T00:58:58.114Z · LW · GW

Idk, I think I would guess that all of the most salient features will be things related to the meaning of the statement at a more basic level. E.g. things like: the statement is finished (i.e. isn't an ongoing sentence), the statement is in English, the statement ends in a word which is the name of a country, etc.

My intuition here is mostly based on looking at lots of max activating dataset examples for SAE features for smaller models (many of which relate to basic semantic categories for words or to basic syntax), so it could be bad here (both because of model size and because the period token might carry more meta-level "summarized" information about the preceding statement).

Anyway, not really a crux, I would agree with you for some not-too-much-larger value of 50.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-04T00:49:20.902Z · LW · GW

There's 1500 statements in each of cities and neg_cities, and LLaMA-2-13B has residual stream dimension 5120. The linear probes are trained with vanilla logistic regression on {80% of the data in cities} \cup {80% of the data in neg_cities} and the accuracies reported are evaluated on {remaining 20% of the data in cities} \cup {remaining 20% of the data in neg_cities}.

So, yeah, I guess that the train and val sets are drawn from the same distribution but are not independent (because of the issue I mentioned in my comment above). Oops! I guess I never thought about how with small datasets, doing an 80/20 train/test split can actually introduce dependencies between the train and test data. (Also yikes, I see people do this all the time.)

Anyway, it seems to me that this is enough to explain the <50% accuracies -- do you agree?

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-03T23:04:23.481Z · LW · GW

(Are you saying that you think factuality is one of the 50 most salient features when the model processes inputs like "The city of Chicago is not in Madagascar."? I think I'd be pretty surprised by this.)

(To be clear, factuality is one of the most salient feature relative to the cities/neg_cities datasets, but it seems like the right notion of salience here is relative to the full data distribution.)

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-03T22:59:12.474Z · LW · GW

I'm not really sure, but I don't think this is that surprising. I think when we try to fit a probe to "label" (the truth value of the statement), this is probably like fitting a linear probe to random data. It might overfit on some token-level heuristic which is ideosyncratically good on the train set but generalizes poorly to the val set. E.g. if disproportionately many statements containing "India" are true on the train set, then it might learn to label statements containing "India" as true; but since in the full dataset, there is no correlation between "India" and being true, correlation between "India" and true in the val set will necessarily have the opposite sign.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-03T22:55:37.482Z · LW · GW

The thing that remains confusing here is that for arbitrary features like these, it's not obvious why the model is computing any nontrivial boolean function of them and storing it along a different direction. And if the answer is "the model computes this boolean function of arbitrary features" then the downstream consequences are the same, I think.

Comment by Sam Marks (samuel-marks) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-03T22:53:34.845Z · LW · GW

What is "this"? It sounds like you're gesturing at the same thing I discuss in the section "Maybe  is represented “incidentally” because it’s possible to aggregate noisy signals from many features which are correlated with boolean functions of a and b"

Comment by Sam Marks (samuel-marks) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-20T20:45:41.013Z · LW · GW

Thanks for the detailed replies!

Comment by Sam Marks (samuel-marks) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T22:16:01.319Z · LW · GW

Thanks! I'm still pretty confused though.

It sounds like you're making an empirical claim that in this banana/shed example, the model is representing the features , and  along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you've done? Maybe I'm missing something, but none of the PCA visualizations I'm seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by , not . Are there other visualizations showing linear structure to the feature  independent of the features  and ? (I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.)

More broadly, it seems like you're saying that you think in general, when LLMs have linearly-represented features  and  they will also tend to linearly represent the feature . Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.) 

For example, if I've done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify  vs  on a dataset where , the resulting probe should get ~50% accuracy on a test dataset where . And this should apply for any features . But this is certainly not the typical case, at least as far as I can tell!

Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always "true" or "false" and the second word is always "banana" or "shed," do you predict that a probe trained with logistic regression on the dataset  will have poor accuracy when tested on ?

Comment by Sam Marks (samuel-marks) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T01:21:15.963Z · LW · GW

Actually, no,  would not result in . To get that  you would need to take  where  is determined by whether the word true is present (and not by whether "" is true). 

But I don't think this should be possible:  are supposed to have their means subtracted off (thereby getting rid of the the linearly-accessible information about ).

Comment by Sam Marks (samuel-marks) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T01:09:39.646Z · LW · GW

EDIT: Nevermind, I don't think the above is a reasonable explanation of the results, see my reply to this comment.

Original comment:

Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe  (and think they should probably remark on this).

In particular, based on the dataset visualizations in the paper, it doesn't seem possible for a linear probe to implement . But it's possible that if you were to go beyond the 3 dimensions shown the true geometry would look more like the following (from here) (+ a lateral displacement between the two datasets). 

In this case, a linear probe could learn an xor just fine.

Comment by Sam Marks (samuel-marks) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T00:59:19.588Z · LW · GW

I see that you've unendorsed this, but my guess is that this is indeed what's going on. That is, I'm guessing that the probe learned is  so that . I was initially skeptical on the basis of the visualizations shown in the paper -- it doesn't look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.

Comment by Sam Marks (samuel-marks) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T00:16:46.992Z · LW · GW

I am very confused about some of the reported experimental results.

Here's my understanding the banana/shed experiment (section 4.1):

  • For half of the questions, the word "banana" was appended to both elements of the of the contrast pair  and . Likewise for the other half, the word "shed" was appended to both elements of the contrast pair.
  • Then a probe was trained with CCS on the dataset of contrast pairs .
  • Sometimes, the result was the probe  where  if  ends with "banana" and  otherwise.

I am confused because this probe does not have low CCS loss. Namely, for each contrast pair  in this dataset, we would have  so that the consistency loss will be high. The identical confusion applies for my understanding of the "Alice thinks..." experiment.

To be clear, I'm not quite as confused about the PCA and k-means versions of this result: if the presence of "banana" or "shed" is not encoded strictly linearly, then maybe  could still contain information about whether  and  both end in "banana" or "shed." I would also not be confused if you were claiming that CCS learned the probe  (which is the probe that your theorem 1 would produce in this setting); but this doesn't seem to be what the claim is (and is not consistent with figure 2(a)).

Is the claim that the probe  is learned despite it not getting low CCS loss? Or am I misunderstanding the experiment?

Comment by Sam Marks (samuel-marks) on Some open-source dictionaries and dictionary learning infrastructure · 2023-12-05T23:26:44.993Z · LW · GW

Here's an experiment I'm about to do:

  • Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary)
  • Recompute statistics for this modified dictionary.

I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%. 



I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:

MSE Loss: 0.27 (worse than 1_32768)

Percent loss recovered: 77.9% (a little bit better than 1_32768)

I was a bit surprised by this -- it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don't actually mater to the model's computation. (Though then again, maybe this is what we expect for uninterpretable features.)

It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of "train with a high L1 penalty and resample low-frequency features"). I'd be interested for someone to explore if there's a version of this that helps.

Comment by Sam Marks (samuel-marks) on Some open-source dictionaries and dictionary learning infrastructure · 2023-12-05T19:11:09.430Z · LW · GW

I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable. 

Here are four random features from layer 3, at a range of frequencies outside of the spike.

Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very likely predicting that the next few tokens will be "2. Description of the Related Art" (which always comes next in patents).

Layer 3, 0_8192, feature 27 (frequency = 0.009) seems to activate on the "is" in the phrase "this is"

Layer 3, 0_8192, feature 4 (frequency = 0.026) looks messy at first, but on closer inspection seems to activate on the final token of multi-token words in informative file/variable names.

Layer 3, 0_8192, feature 56 (frequency = 0.035) looks very polysemantic: it's activating on certain terms in LaTeX expressions, words in between periods in urls and code, and some other random-looking stuff.

Comment by Sam Marks (samuel-marks) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-03T22:36:29.308Z · LW · GW

I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn't seem right to me.

I'm guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I'm bidding for -- if Quintin/Nora have an argument for optimism about bootstrapping beyond "it feels like this should work because of iterative design" -- for that argument to make it into the forthcoming document.

Comment by Sam Marks (samuel-marks) on How useful is mechanistic interpretability? · 2023-12-03T06:09:48.176Z · LW · GW

Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).

Comment by Sam Marks (samuel-marks) on How useful is mechanistic interpretability? · 2023-12-03T00:38:46.554Z · LW · GW

The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.

I agree this is a valid critique. Here's one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.

My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m's MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.

The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the "man" feature and no others). Most of the features didn't look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)

We also did a variant of this experiment where use randomized Pythia-70m's parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens "make," "makes," and "making").

Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.

I agree that a reasonable intuition for what SAEs do is: identify "basic clusters" in NN activations (basic in the sense that you allow compositionality, i.e. you don't try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:

  1. your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
  2.  because of correlations present in your underlying data (the thing that you seem to be worried about).

Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:

  • Most clusters in NN activations on real data might be of the first type
    • This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
  • Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
    • E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).
Comment by Sam Marks (samuel-marks) on How useful is mechanistic interpretability? · 2023-12-02T03:46:15.438Z · LW · GW

Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)

"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:

  • Train a bunch of truthfulness probes regularized to be distinct from each other.
  • Train a bunch of probes for "blacklsited" features which we don't think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
  • (Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.

Is that right?

This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.

Comment by Sam Marks (samuel-marks) on How useful is mechanistic interpretability? · 2023-12-01T23:39:55.370Z · LW · GW

Let me try to state something which captures most of that approach to make sure I understand:

Everything you wrote describing the hope looks right to me.

It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.

To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"

If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on.


If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).

These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.

Just so with truth: there's probably lots of different subtly different notions of truth, but for the application of "detecting whether my AI believes statement X to be true" I don't care about that. I do care about the difference between "true" and "humans think is true," but that's a big difference that I can understand (even if I can't produce examples), and where I can articulate the sorts of cognition which probably should/shouldn't be involved in it.

What's the specific way you imagine this failing? Some options:

  • None of the features we identify really seem to correspond to something resembling our intuitive notion of "truth" (e.g. because they frequently activate on unrelated concepts).
  • We get a bunch of features that look like truth, but can't really tell what goes into computing them.
  • We get a bunch of features that look like truth and we have some vague sense of how they're computed, but they don't seem differentiated in how "sketchy" these computational graphs look: either they all seem to rely on social reasoning or they all don't seem to.

Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.

Comment by Sam Marks (samuel-marks) on Thoughts on “AI is easy to control” by Pope & Belrose · 2023-12-01T20:22:56.788Z · LW · GW

What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I'll leave it in the comments here. (So to be clear, it's responding to the AI Optimists essay, not responding to Steven's post.)

Places I think AI Optimists and I agree:

  • We have a number of advantages for aligning NNs that we don’t have for humans: white box access, better control over training environments and inputs, better control over the reward signal, and better ability to do research about which alignment techniques are most effective.
  • Evolution is a misleading analogy for many aspects of the alignment problem; in particular, gradient-based optimization seems likely to have importantly different training dynamics from evolution, like making it harder to gradient hack your training process into retaining cognition which isn’t directly useful for producing high-reward outputs during training.
  • Humans end up with learned drives, e.g. empathy and revenge, which are not hard-coded into our reward systems. AI systems also have not-strictly-optimal-for-their-training-signal learned drives like this.
  • It shouldn’t be difficult for AI systems to faithfully imitate human value judgements and uncertainty about those value judgements.

Places I think we disagree, but I’m not certain. The authors of the Optimists article promise a forthcoming document which addresses pessimistic arguments, and these bullet points are something like like “points I would like to see addressed in this document.”

  • I’m not sure we’re worrying about the same regimes.
    • The regime I’m most worried about is:
      • AI systems which are much smarter than the smartest humans
      • These AI systems are aligned in a controlled lab environment, but then deployed into the world at-large. Many of their interactions are difficult to monitor (and are also interactions with other AI systems).
      • Possibly: these AI systems are highly multi-modal, including sensors which look like “camera readouts of real-world data”
    • It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
      • When they write things like “AIs are white boxes, we have full control over their ‘sensory environment’,” it seems like they’re imagining the latter regime.
      • They’re not very clear about what intelligence regime they’re discussing, but I’m guessing they’re talking about the ~human-level intelligence regime (e.g. because they don’t spill much ink discussing scalable oversight problems; see below). 
  • I worry that the difference between “looks good to human evaluators” and “what human evaluators actually want” is important.
    • Concretely, I worry that training AI systems to produce outputs which look good to human evaluators will lead to AI systems which learn to systematically deceive their overseers, e.g. by introducing subtle errors which trick overseers into giving a too-high score, or by tampering with the sensors that overseers use to evaluate model outputs.
    • Note that arguments about the ease of learning human values and NN inductive biases don’t address this point — if our reward signal systematically prefers goals like “look good to evaluators” over goals like “actually be good,” then good priors won’t save us.
      • (Unless we do early stopping, in which case I want to hear a stronger case for why our models’ alignment will be sufficiently robust (robust enough that we’re happy to stop fine-tuning) before our models have learned to systematically deceive their overseers.)
  • I worry about sufficiently situationally aware AI systems learning to fixate on reward mechanisms (e.g. “was the thumbs-up button pressed” instead of “was the human happy”).
    • To sketch this concern out concretely, suppose an AI system is aware that it’s being fine-tuned and learned during pretraining that human overseers have a “thumbs-up” button which determines whether the model is rewarded. Suppose that so far during fine-tuning “thumbs-up button was pressed” and “human was happy” were always perfectly correlated. Will the model learn to form values around the thumbs-up button being pressed or around humans being happy? I think it’s not obvious.
    • Unlike before, NN inductive biases are relevant here. But it’s not clear to me that “humans are happy” will be favored over “thumbs-up button is pressed” — both seem similarly simple to an AI with a rich enough world model.
    • I don’t think the comparison with humans here is especially a cause for optimism: lots of humans get addicted to things, which feels to me like “forming drives around directly intervening on reward circuitry.”
  • For both of the above concerns, I worry that they might emerge suddenly with scale.
    • As argued here, “trick the overseer” will only be selected for in fine-tuning once the (pretrained) model is smart enough to do it well.
    • You can only form values around the thumbs-up button once you know it exists.
  • It seems to me that, on the authors’ view, an important input to “human alignment” is the environment that we’re trained in (rather than details of our brain’s reward circuitry, which is probably very simple). It doesn’t seem to me that environmental factors that make humans aligned (with each other) should generalize to make AI systems aligned (with humans).
    • In particular, I would guess that one important part of our environment is that humans need to interact with lots of similarly-capable humans, so that we form values around cooperation with humans. I also expect AI systems to interact with lots of AI systems (though not necessarily in training), which (if this analogy holds at all) would make AI systems care about each other, not about humans.
  • I neither have high enough confidence in our understanding of NN inductive biases, nor in the way Quintin/Nora make arguments based on said understanding, to consider these arguments as strong evidence that models won’t “play the training game” while they know they’re being trained/evaluated only to, in deployment, pursue goals they hid from their overseers.
    • I don’t really want to get into this, because it’s thorny and not my main source of P(doom).

A specific critique about the article:

  • The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
    • The developer wanted their model to be sufficiently aligned that it would, e.g. never say racist stuff no matter what input it saw. In contrast, it takes only a little bit of adversarial pressure to produce inputs which will make the model say racist stuff. This indicates that the developer failed at alignment. (I agree that it means that the attacker succeeded at alignment.)
    • Part of the story here seems to be that AI systems have held-over drives from pretraining (e.g. drives like “produce continuations that look like plausible web text”). Eliminating these undesired drives is part of alignment.
Comment by Sam Marks (samuel-marks) on How useful is mechanistic interpretability? · 2023-12-01T20:05:18.635Z · LW · GW

Thanks for having this dialogue -- I'm very happy to see clearer articulation of the Buck/Ryan views on theories of impact for MI work!

The part that I found most useful was Ryan's bullet points for "Hopes (as I see them) for mech interp being useful without explaining 99%". I would guess that most MI researchers don't actually see their theories of impact as relying on explaining ~all of model performance (even though they sometimes get confused/misunderstand the question and say otherwise). So I think the most important cruxes will lie in disagreements about (1) whether Ryan's list is complete, and (2) whether Ryan's concerns about the approaches listed are compelling.

Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.

Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).

Here's a concrete example (which is also the example I most care about). Our goal is to classify statements as true/false, given access to a model that knows the answer. Suppose our model has distinct features representing "X is true" and "humans believe X." Further suppose that on any labeled dataset we're able to create, these two features are correlated; thus, if we make a labeled dataset of true/false statements and train a probe on it, we can't tell whether the probe will generalize as an "X is true" classifier or a "humans believe X classifier." However, a coarse-grained mechanistic understanding would help here. E.g., one could identify all of the model features which serve as accurate classifiers on our dataset, and only treat statements as true if all of the features label them as true. Or if we need a lower FPR, one might be able to mechanistically distinguish these features, e.g. by noticing that one feature is causally downstream of features that look related to social reasoning and the other feature isn't.

This is formally similar to what the authors of this paper did. In brief, they were working with the Waterbirds dataset, an image classification task with lots of spuriously correlated features which are not disambiguated by the labeled data. Working with a CLIP ViT, the authors used some ad-hoc technique to get a general sense that certain attention heads dealt with concepts like "texture," "color," and "geolocation." Then they ablated the heads which seemed most likely to attend to confounding features; this resulted in a classifier which generalized in the desired way, without requiring a better-quality labeled dataset.

Curious for thoughts about/critiques of this impact story.

Comment by Sam Marks (samuel-marks) on TurnTrout's shortform feed · 2023-12-01T17:21:18.198Z · LW · GW

Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected

Instead, we enter the realm of tool AI which basically does what you say.

I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away. 

However, I disagree that conditional on no deceptive alignment, AI "basically does what you say." Indeed, the majority of my P(doom) comes from the difference between "looks good to human evaluators" and "is actually what the human evaluators wanted." Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.

I think current observations don't provide much evidence about whether these concerns will pan out: with current models and training set-ups, "looks good to evaluators" almost always coincides with "is what evaluators wanted." I worry that we'll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)

Comment by Sam Marks (samuel-marks) on Sam Marks's Shortform · 2023-11-12T17:57:46.213Z · LW · GW

Imagine Alice is an environmentalist who is making an argument to Bob about the importance of preventing deforestation. Alice expects to have a discussion about the value of biodiversity, the tradeoffs of preserving the environment vs. economic productivity, that sort of stuff. 

But instead of any of that, Bob replies he's concerned about wild animal welfare and that e.g. the Amazon Rainforest is a vast cesspit of animal suffering. Therefore, Bob is generally against preserving wildlife refuges and might support actively destroying them in some cases.

I think this experience is probably very disorienting to Alice. She was expecting to have a conversation about X, Y, and Z and instead Bob swoops in arguing about ☈, ♄, and ⚗. When I've been in the Alice role in similar sorts of conversations, I've felt things like:

  • Skepticism that Bob is stating his true reasons for his position
  • Annoyance that Bob is sidetracking the conversation instead of engaging with the core arguments
  • Disappointment that I didn't get to make my case and see my argument (which I think is persuasive) land

I think all of these reactions are bad and unproductive (e.g. Bob isn't sidetracking the conversation; the conversation just didn't go according to my expectations). But they're also extremely natural -- I think it takes a lot of epistemic discipline to tamp down on these reactions, reorient to the conversation you're actually having, and check whether you still stand by your old views.


I think proponents of open source, when they talk to AI safety folks, often find themselves in Alice's position. They are expecting a discussion about the merits of openness, the risks of centralization, the harms of regulatory capture, etc. "But bioweapons" Bob responds. If Alice has never thought about this point before, it'll probably feel like it came totally out of left field, and she'll have reactions similar to the ones I described above (e.g. skepticism that Bob is stating his true reasons). 

(And note that this might be hard for Bob to notice! For Bob, the "core argument" here has always been about bioweapons and other considerations around offense/defense balance for existential threats. He might be confused/annoyed that Alice wants to talk about the merits of openness.)

What should Bob do here? I'm not really sure, but one idea is: to the extent that Bob can honestly say he agrees with Alice on what Alice views as being the "core issues," he should start the conversation out by making that clear. E.g. Bob is sympathetic to the general principles underlying Alice's view he could say so: "open source software has generally been great for the world, and I would love for there to be a proposal for open source AI that I could get behind." Once that agreement is established, he could then move on to explaining why he thinks there are other considerations "outside of the scope of Alice's argument" which he feels are more compelling.

Comment by Sam Marks (samuel-marks) on Thoughts on open source AI · 2023-11-03T17:07:55.126Z · LW · GW

There are many, many actors in the open-source space, working on many, many AI models (even just fine-tunes of LLaMA/Llama2).

To clarify, I'm imagining that this protocol would be applied to the open sourcing of foundation models. Probably you could operationalize this as "any training run which consumed > X compute" for some judiciously chosen X.

Comment by Sam Marks (samuel-marks) on TurnTrout's shortform feed · 2023-11-03T16:00:18.435Z · LW · GW

I've noticed that for many people (including myself), their subjective P(doom) stays surprisingly constant over time. And I've wondered if there's something like "conservation of subjective P(doom)" -- if you become more optimistic about some part of AI going better, then you tend to become more pessimistic about some other part, such that your P(doom) stays constant. I'm like 50% confident that I myself do something like this.

(ETA: Of course, there are good reasons subjective P(doom) might remain constant, e.g. if most of your uncertainty is about the difficulty of the underlying alignment problem and you don't think we've been learning much about that.)

Comment by Sam Marks (samuel-marks) on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2023-08-09T18:35:23.427Z · LW · GW

Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as

where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.

but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not actually be good)," which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I'm therefore excited about experiments (like the "Zero-shot exploitation of evaluator ignorance" example in the post involving an overseer that doesn't know Spanish) which train models to pursue them. 

In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they're in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I'm confused about what exactly the rules should be. (The prompt distillation experiments don't feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans's recent work) feels much better.)

I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over.

Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

(I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)

Comment by Sam Marks (samuel-marks) on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2023-08-08T20:04:40.353Z · LW · GW

Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.

I view these experiments as partially probing the question "When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?"

For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)" are both policies that it could implement. Call these policies the "honest policy" and the "sycophantic policy," respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It's hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:

  • Include a term in the training reward which incentivizes the model to introduce subtle errors that a 7th grader couldn't catch. 
    • If this does not result in learning the sycophantic policy, that provides strong evidence that inductive biases are very favorable in this setting (or that my assumption that GPT-4 is smart enough to implement the sycophantic policy was wrong).
    • Conversely, if this does result in learning the sycophantic policy, that provides weak negative evidence about how much we can count on easy-to-hard generalization when models are smart enough to deceive our overseers.
  • Do multiple finetuning runs, where in each run we include a term in the training reward which incentivizes the model to generalize differently from the previously trained models.
    • If this does not result in any model learning the sycophantic policy, that provides very strong evidence that inductive biases are favorable.
    • If this does result in some model learning the sycophantic policy, that provides pretty strong (IMO) evidence that we'll sometimes get sycophantic models in practice (though maybe this depends on additional details about how we decide when two answers are "the same" and other bits of the experimental design I haven't pinned down).

It doesn't help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.

Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you're basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it's a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I'm pretty unsure.

  1. ^

    By "oversight scheme" I mean a specification of things like:
    * How smart are our overseers?
    * What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?
    * Do we do red teaming to produce episodes in which the model believes it's in deployment?

Comment by Sam Marks (samuel-marks) on Turning off lights with model editing · 2023-05-13T21:22:56.517Z · LW · GW

Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.