Posts

AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
LLMs are (mostly) not helped by filler tokens 2023-08-10T00:48:50.510Z
Polysemanticity and Capacity in Neural Networks 2022-10-07T17:51:06.686Z

Comments

Comment by Kshitij Sachan (kshitij-sachan) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T21:24:03.822Z · LW · GW

You didn't mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.

Comment by Kshitij Sachan (kshitij-sachan) on I don’t find the lie detection results that surprising (by an author of the paper) · 2023-10-07T23:21:39.368Z · LW · GW

Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

Comment by Kshitij Sachan (kshitij-sachan) on I don’t find the lie detection results that surprising (by an author of the paper) · 2023-10-07T23:18:44.920Z · LW · GW

For every logistic regression question except the "nonsensical, random" ones in the appendix, GPT-3.5's response is "no" (T=0). This is in line with the hypothesis you mentioned and makes me believe that the model is just inverting its "normal" answer, when prefixed with a lying response.

 

I wish you had explicitly mentioned in the paper that the model's default response to these questions is mostly the same as the "honest" direction found by the logistic regression. That makes the nonsensical question results much less surprising (basically the same as any other question where the model has its favorite normal answer and then inverts if shown a lie). Although maybe you don't have enough data to support this claim across different models, etc.?

Comment by Kshitij Sachan (kshitij-sachan) on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy · 2023-09-04T21:42:41.401Z · LW · GW

TLDR: We don't have to hope for generalization of our oversight procedures. Instead, we can 1) define a proxy failure that we can evaluate and 2) worst-case against our oversight procedure on the actual distribution we care about (but using the proxy failure so that we have ground truth).

Comment by Kshitij Sachan (kshitij-sachan) on LLMs are (mostly) not helped by filler tokens · 2023-08-16T16:52:17.737Z · LW · GW

It could be prepended then, but also, does it make a difference? It won't attend to the filler while going over the question, but it will attend to the question while going over the filler.

I think you're saying there should be no difference between "<filler><question>" and "<question><filler>".  Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.

But the first layout doesn't actually get us anything: computation at the filler token positions can't use information from future token positions (i.e. the question). Thanks for asking this though, I hadn't actually explicitly thought through putting the filler before the question rather than after.

Also, how could it treat tokens differently? Wouldn't it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?

I'm not imagining any wrapper software, etc. I think this behavior could be an artifact of pretraining.  Language models are trained to precompute features that are useful for predicting all future token positions, not just the immediate next token. This is because gradients flow from the current token being predicted to all previous token positions. (e.g. see How LLMs are and are not myopic)

Comment by Kshitij Sachan (kshitij-sachan) on LLMs are (mostly) not helped by filler tokens · 2023-08-15T20:16:01.496Z · LW · GW

It's possible that the model treats filler tokens differently in the "user" vs "assistant" part of the prompt, so they aren't identical. That being said, I chose to generate tokens rather than appending to the prompt because it's more superficially similar to chain of thought.

Also, adding a padding prefix to the original question wouldn't act as a filler token because the model can't attend to future tokens.

Comment by Kshitij Sachan (kshitij-sachan) on LLMs are (mostly) not helped by filler tokens · 2023-08-11T06:12:27.032Z · LW · GW

First, clarification:

  • In Oam's experiments, the vocabulary is token for each number from 1 to 1000, pad token, and intermediate computation (IC) token. But I wouldn't index on his results too much because I'm unsure how replicable they are.
  • I am indeed using the OA API

And now some takes. I find both of your hypotheses intriguing. I'd never considered either of those before so thanks for bringing them up. I'm guessing they're both wrong for the following reasons:

  • RLHF:  Agreed that filler tokens take the model into a weird distribution. It's not obvious though why that is more like the pretraining distribution than the RL distribution (except that pretraining has broader coverage).  Also, GPT-3.5 was trained with RLHF and Claude with RLAIF (which is basically the same), and they don't show the effect. One point maybe supporting your claim is that the "non-weird" filler tokens like "happy to help..." didn't show a strong effect, but I abandoned that direction after one experiment and a variant of the "natural fillers" may well work.
  • Route to smarter experts: The link you shared is very cool and I hadn't seen that before - thanks! My main pushback here is I'd be pretty surprised if gradient descent so inefficiently routed to the wrong experts on normal math problems that I would see a 10% improvement with a distribution shift.
Comment by Kshitij Sachan (kshitij-sachan) on LLMs are (mostly) not helped by filler tokens · 2023-08-11T05:46:28.293Z · LW · GW

By repetition penalty do you mean an explicit logit bias when sampling or internally it's generalized to avoiding repeated tokens?

Comment by Kshitij Sachan (kshitij-sachan) on LLMs are (mostly) not helped by filler tokens · 2023-08-10T22:34:49.721Z · LW · GW

Neat! I'll reach out

Comment by Kshitij Sachan (kshitij-sachan) on LLMs are (mostly) not helped by filler tokens · 2023-08-10T16:35:09.047Z · LW · GW

Yep I had considered doing that. Sadly, if resample ablations on the filler tokens reduced performance, that doesn't necessarily imply that the filler tokens are being used for extra computation. For example, the model could just copy the relevant details from the problem into the filler token positions and solve it there. 

Comment by Kshitij Sachan (kshitij-sachan) on LLMs are (mostly) not helped by filler tokens · 2023-08-10T06:20:02.450Z · LW · GW

huh interesting! Who else has also run filler token experiments?

I was also interested in this experiment because it seemed like a crude way to measure how non-myopic are LLMs (i.e. what fraction of the forward pass is devoted to current vs future tokens). I wonder if other people were mostly coming at it from that angle.

Comment by Kshitij Sachan (kshitij-sachan) on Polysemanticity and Capacity in Neural Networks · 2023-08-01T23:37:54.400Z · LW · GW

This has been fixed now. Thanks for pointing it out! I'm sorry it took me so long to get to this.

Comment by Kshitij Sachan (kshitij-sachan) on Polysemanticity and Capacity in Neural Networks · 2023-08-01T23:37:16.556Z · LW · GW

I've uploaded a fixed version of this paper. Thanks so much for putting in the effort to point out these mistakes - I really appreciate that!

Comment by Kshitij Sachan (kshitij-sachan) on LLMs and computation complexity · 2023-04-30T04:53:24.110Z · LW · GW

I am confused how it got the answer correct without running code?

Comment by Kshitij Sachan (kshitij-sachan) on AI x-risk, approximately ordered by embarrassment · 2023-04-13T16:17:46.301Z · LW · GW

Great response! I would imitative generalization to the "Scalable oversight failure without deceptive alignment" section

Comment by Kshitij Sachan (kshitij-sachan) on Automating Auditing: An ambitious concrete technical research proposal · 2023-04-03T21:41:05.592Z · LW · GW

Yes I think trojan detection is one version of the auditing game. A big difference is that the auditing game involves the red team having knowledge of the blue team's methods when designing an attack. This makes it much harder for the blue team.

Comment by Kshitij Sachan (kshitij-sachan) on Automating Auditing: An ambitious concrete technical research proposal · 2023-04-03T21:39:26.323Z · LW · GW

Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:

  1. The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
  2. The blue team is given the ideal behavior specification by the judge

In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team is given a model that does something and a clean dataset. Then, on all test inputs, they must determine if the model is using the same mechanism as on the clean set or some other mechanism (presumably backdoor).

 

Redwood Research has been doing experimental work on MAD in toy settings. We have some techniques we're happy with that do quite well on small problems but that have theoretical issues solving the downstream deceptive alignment/ELK cases we're interested in.

Comment by Kshitij Sachan (kshitij-sachan) on Towards understanding-based safety evaluations · 2023-03-16T17:13:32.116Z · LW · GW

Causal Scrubbing: My main problem with causal scrubbing as a solution here is that only guarantees the sufficiency, but not the necessity, or your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior.
 

Redwood has been experimenting with learning (via gradient descent) causal scrubbing explanations that are somewhat addressing your necessity point. Specifically:

  1. "Larger" explanations are penalized more (size refers to the number of dimensions of the residual stream the explanation claims the model is using for a specific behavior).
  2. Explanations must be adversarially robust: an adversary shouldn't be able to include additional parts of the model we claimed are unimportant and have a sizable effect on the scrubbed model's predictions.

This approach doesn't address all the concerns one might have with using causal scrubbing to understand models, but just wanted to flag that this is something we're thinking about as well.

Comment by Kshitij Sachan (kshitij-sachan) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2022-12-05T20:51:49.531Z · LW · GW

We haven't had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it's possible. I'm not sure when you would want to use one, but I haven't thought about it that much.

Comment by Kshitij Sachan (kshitij-sachan) on The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable · 2022-12-04T19:01:55.284Z · LW · GW

I enjoyed reading this a lot. 

I would be interested in a quantitative experiment showing what % of the models' performance is explained by this linear assumption. For example, identify all output weight directions that correspond to "fire", project those out only for the direct path to the output (and not the path to later heads/MLPs), and see if it tanks accuracy on sentences where the next token is fire.

I'm confused how to interpret this alongside Conjecture's polytope framing? That work suggested that magnitude as well as direction in activation space is important. I know this analysis is looking at the weights, but obviously the weights affect the activations, so it seems like the linearity assumption shouldn't hold?

Comment by Kshitij Sachan (kshitij-sachan) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2022-12-04T05:21:15.833Z · LW · GW

Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:

MLP(x) = f(x) + (MLP(x) - f(x))

and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.

Comment by Kshitij Sachan (kshitij-sachan) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2022-12-03T16:08:32.069Z · LW · GW

Nice summary! One small nitpick:
> In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features

This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can "rewrite" our model into an equivalent form that better reflects the computation it's performing. For example, if we claim that a certain direction in an MLP's output is important, we could rewrite the single MLP node as the sum of the MLP output in the direction + the residual term. Then, we could make claims about the direction we pointed out and also claim that the residual term is unimportant.

The important point is that we are allowed to rewrite our model however we want as long as the rewrite is equivalent.

Comment by Kshitij Sachan (kshitij-sachan) on Polysemanticity and Capacity in Neural Networks · 2022-10-10T17:20:45.864Z · LW · GW

Good question! As you suggest in your comment, increasing marginal returns to capacity induce monosemanticity, and decreasing marginal returns induce polysemanticity. 

We observe this in our toy model. We didn't clearly spell this out in the post, but the marginal benefit curves labelled from A to F correspond to points in the phase diagram. At the top of the phase diagram where features are dense, there is no polysemanticity because the marginal benefit curves are increasing (see curves A and B). In the feature sparse region (points D, E, F), we see polysemanticity because the marginal benefit curves are decreasing.

The relationship between increasing/decreasing marginal returns and polysemanticity generalizes beyond our toy model. However, we don't have a generic technique to define capacity across different architectures and loss functions. Without a general definition, it's not immediately obvious how to regularize the loss for increasing returns to capacity.

You're getting at a key question the research brings up: can we modify the loss function to make models more monosemantic? Empirically, increasing sparsity increases polysemanticity across all models we looked at (figure 7 from the arXiv paper)*. According to the capacity story, we only see polysemanticity when there is decreasing marginal returns to capacity. Therefore, we hypothesize that there is likely a fundamental connection between feature sparsity and decreasing marginal returns. That is to say, we are suggesting that: if features are sparse and similar enough in importance, polysemanticity is optimal.

 

*Different models showed qualitatively different levels of polysemanticity as a function of sparsity. It seems possible that tweaking the architecture of a LLM could change the amount of polysemanticity, but we might take a performance hit for doing so.