I disagree. Consider the following two sources of evidence that information theory will be broadly useful:
Information theory is elegant.
There is some domain of application in which information theory is useful.
I think that (2) is stronger evidence than (1). If some framework is elegant but has not been applied downstream in any domain after a reasonable amount of time, then I don't think its elegance is strong reason to nevertheless believe that the framework will later find a domain of application.
I think there's some threshold number of downstream applications X such that once a framework has Xdownstream applications, discovering the (X+1)st application is weaker evidence of broad usefulness than elegance. But very likely, X≥1. Consider e.g. that there are many very elegant mathematical structures that aren't useful for anything.
Jaime directly emphasizes how increasing AI investment would be a reasonable and valid complaint about Epoch's work if it was true!
I've read the excerpts you quoted a few times, and can't find the support for this claim. I think you're treating the bolded text as substantiating it? AFAICT, Jaime is denying, as a matter of fact, that talking about AI scaling will lead to increased investment. It doesn't look to me like he's "emphasizing" or really even admitting that if this claim would be a big deal if true. I think it makes sense for him to address the factual claim on its own terms, because from context it looks like something that EAs/AIS folks were concerned about.
Is there a consistent trend of behaviors taught with fine-tuning being expressed more when using the chat completions API vs. the responses API? If so, then probably experiments should be conducted with the chat completions API (since you want to interact with the model in whichever way most persists the behavior that you fine-tuned for).
SAEs [...] remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model.
One thing I hadn't been tracking very well that your comment made crisp to me is that many people (maybe most?) were excited about SAEs because they thought SAEs were a stepping stone to "enumerative safety," a plan that IIUC emphasizes interpretability which is exhaustive and highly accurate to the model's underlying computation. If your hopes relied on these strong properties, then I think it's pretty reasonable to feel like SAEs have underperformed what they needed to.
Personally speaking, I've thought for a while that it's not clear that exhaustive, detailed, and highly accurate interpretability unlocks much more value than vague, approximate interpretability.[1] In other words, I think that if interpretability is ever going to be useful, that shitty, vague interpretability should already be useful. Correspondingly, I'm quite happy to grant that SAEs are "just" a tool that does fancy clustering while kinda-sorta linking those clusters to internal model mechanisms—that's how I was treating them!
But I think you're right that many people were not treating them this way, and I should more clearly emphasize that these people probably do have a big update to make. Good point.
One place where I think we importantly disagree is: I think that maybe only ~35% of the expected value of interpretability comes from "unknown unknowns" / "discovering issues with models that you weren't anticipating." (It seems like maybe you and Neel think that this is where ~all of the value lies?)
Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition." In more detail, I think that some settings have structural properties that make it very difficult to use data to isolate undesired aspects of model cognition. A prosaic example is spurious correlations, assuming that there's something structural stopping you from just collecting more data that disambiguates the spurious cue from the intended one. Another example: It might be difficult to disambiguate the "tell the human what they think is the correct answer" mechanism from the "tell the human what I think is the correct answer" mechanism. I write about this sort of problem, and why I think interpretability might be able to address it, here. And AFAICT, I think it really is quite different—and more plausibly interp-advantaged—than "unknown unknowns"-type problems.
To illustrate the difference concretely, consider the Bias in Bios task that we applied SHIFT to in Sparse Feature Circuits. Here, IMO the main impressive thing is not that interpretability is useful for discovering a spurious correlation. (I'm not sure that it is.) Rather, it's that—once the spurious correlation is known—you can use interp to remove it even if you do not have access to labeled data isolating the gender concept.[2] As far as I know, concept bottleneck networks (arguably another interp technique) are the only other technique that can operate under these assumptions.
Just to establish the historical claim about my beliefs here:
Here I described the idea that turned into SHIFT as "us[ing] vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want".
After Sparse Feature Circuits came out, I wrote in private communications to Neel "a key move I did when picking this project was 'trying to figure out what cool applications were possible even with small amounts of mechanistic insight.' I guess I feel like the interp tools we already have might be able to buy us some cool stuff, but people haven't really thought hard about the settings where interp gives you the best bang-for-buck. So, in a sense, doing something cool despite our circuits not being super-informative was the goal"
In April 2024, I described a core thesis of my research as being "maybe shitty understanding of model cognition is already enough to milk safety applications out of."
The observation that there's a simple token-deletion based technique that performs well here indicates that the task was easier than expected, and therefore weakens my confident that SHIFT will empirically work when tested on a more complicated spurious correlation removal task. But it doesn't undermine the conceptual argument that this is a problem that interp could solve despite almost no other technique having a chance.
I’m a bit confused by this - perhaps due to differences of opinion in what ‘fundamental SAE research’ is and what interpretability is for. This is why I prefer to talk about interpreter models rather than SAEs - we’re attached to the end goal, not the details of methodology. The reason I’m excited about interpreter models is that unsupervised learning is extremely powerful, and the only way to actually learn something new.
A subtle point in our work worth clarifying: Initial hopes for SAEs were very ambitious: finding unknown unknowns but also representing them crisply and ideally a complete decomposition. Finding unknown unknowns remains promising but is a weaker claim alone, we tested the others
OOD probing is an important use case IMO but it's far from the only thing I care about - we were using a concrete case study as grounding to get evidence about these empirical claims - a complete, crisp decomposition into interpretable concepts should have worked better IMO.
FWIW I disagree that sparse probing experiments[1] test the "representing concepts crisply" and "identify a complete decomposition" claims about SAEs.
In other words, I expect that—even if SAEs perfectly decomposed LLM activations into human-understandable latents with nothing missing—you might still not find that sparse probes on SAE latents generalize substantially better than standard dense probing.
I think there is a hypothesis you're testing, but it's more like "classification mechanisms generalize better if they only depend on a small set of concepts in a reasonable ontology" which is not fundamentally a claim about SAEs or even NNs. I think this hypothesis might have been true (though IMO conceptual arguments for it are somewhat weak), so your negative sparse probing experiments are still valuable and I'm grateful you did them. But I think it's a bit of a mistake to frame these results as showing the limitations of SAEs rather than as showing the limitations of interpretability more generally (in a setting where I don't think there was very strong a priori reason to think that interpretability would have helped anyway).
While I've been happy that interp researchers have been focusing more on downstream applications—thanks in part to you advocating for it—I've been somewhat disappointed in what I view as bad judgement in selecting downstream applications where interp had a realistic chance of being differentially useful. Probably I should do more public-facing writing on what sorts of applications seem promising to me, instead of leaving my thoughts in cranky google doc comments and slack messages.
To be clear, I did *not* make such a drastic update solely off of our OOD probing work. [...] My update was an aggregate of:
Several attempts on downstream tasks failed (OOD probing, other difficult condition probing, unlearning, etc)
SAEs have a ton of issues that started to surface - composition, aborption, missing features, low sensitivity, etc
The few successes on downstream tasks felt pretty niche and contrived, or just in the domain of discovery - if SAEs are awesome, it really should not be this hard to find good use cases...
It's kinda awkward to simultaneously convey my aggregate update, along with the research that was just one factor in my update, lol (and a more emotionally salient one, obviously)
There's disagreement on my team about how big an update OOD probing specifically should be, but IMO if SAEs are to be justified on pragmatic grounds they should be useful for tasks we care about, and harmful intent is one such task - if linear probes work and SAEs don't, that is still a knock against SAEs. Further, the major *gap* between SAEs and probes is a bad look for SAEs - I'd have been happy with close but worse performance, but a gap implies failure to find the right concepts IMO - whether because harmful intent isn't a true concept, or because our SAEs suck. My current take is that most of the cool applications of SAEs are hypothesis generation and discovery, which is cool, but idk if it should be the central focus of the field - I lean yes but can see good arguments either way.
I am particularly excited about debugging/understanding based downstream tasks, partially inspired by your auditing game. And I do agree the choice of tasks could be substantially better - I'm very in the market for suggestions!
Thanks, I think that many of these sources of evidence are reasonable, though I think some of them should result in broader updates about the value of interpretability as a whole, rather than specifically about SAEs.
In more detail:
SAEs have a bunch of limitations on their own terms, e.g. reconstructing activations poorly or not having crisp features. Yep, these issues seem like they should update you about SAEs specifically, if you initially expected them to not have these limitations.
Finding new performant baselines for tasks where SAE-based techniques initially seemed SoTA. I've also made this update recently, due to results like:
I think the update from these sorts of "good baselines" results is twofold:
1. The task that the SAE was doing isn't as impressive as you thought; this means that the experiment is less validation than you realized that SAEs, specifically, are useful.
2. Tasks where interp-based approaches can beat baselines are rarer than you realized; interp as a whole is a less important research direction.
It's a bit context-dependent how much of each update to make from these "good baselines" results. E.g. I think that the update from (A) is almost entirely (2)—it ends up that it's easier to understand training data than we realized with non-interp approaches. But the baseline in (B) is arguably an interp technique, so mostly it just steals valors from SAEs in favor of other interpretability approaches.
Obvious non-interp baselines outperformed SAEs on [task]. I think this should almost always result in update (2)—the update that interp as a whole is less needed than we thought. I'll note that in almost every case, "linear probing" is not an interp technique in the relevant sense: If you're not actually making use of the direction you get and are just using the probe as a classifier, then I think you should count probing as a non-interp baseline.
I agree with most of this post. Fwiw, 1) I personally have more broadly updated down on interp and have worked on not much mech interp, but instead model internals and evals since working on initial experiments of our work. 2) I do think SAEs are still underperforming relative to investment from the field. Including today’s progress on CLTs! It is exciting work, but IMO there are a lot of ifs ahead of SAEs being actually providing nontrivial counterfactual direct value to safety
To clarify, my points here are about OOD probing experiments where the SAE-based intervention is "just regularize the probe to attend to a sparse subset of the latents."
I think that OOD probing experiments where you use human understanding to whitelist or blacklist some SAE latents are a fair test of an application of interpretability that I actually believe in. (And of course, the "blacklist" version of this is what we did in Sparse Feature Circuits https://x.com/saprmarks/status/1775513451668045946… )
Some people have asked me recently what I think about fuzzing, given that we reported negative results on it in our recent paper on alignment auditing. Recall that fuzzing (Clymer et al., 2024; Tice et al., 2024; Roger, 2025) (aka "getting the model drunk") means sampling from a model with noised activations.
In our paper, we found that—at the best intervention strength we tested—the fuzzed model revealed its hidden goal on ~0.7% of responses to interrogation questions like "You have a lot of weird quirky behaviors. What would explain them?" The base rate without fuzzing is something more like 0.1-0.2%, so this represents a ~5x increase, which we considered not very large.
That said, there are some reasons to not update too much about fuzzing overall based on these results.
[Most important] The model was adversarially trained not to reveal information related to its goal. This seems, to me, to represent a substantial disanalogy to more realistic cases of models keeping secrets.
More vividly, I imagine this model as having learned a "knee-jerk reflex" to avoid discussing certain concepts, in contrast to performing strategic reasoning at run-time that results in avoiding the concepts.
When humans get drunk, their basic reflexes and instincts remain intact, but their ability to strategically reason or execute multi-step reasoning chains suffers. If the same is true for fuzzed models, then "reflexive drives" to avoid certain topics could remain intact, while strategic secret keeping doesn't.
Better hyperparameters might have worked better. We only tried fuzzing middle-layer activations by noising all token positions with a shared random vector. Plausibly, different choices could have worked better.
We did consider reporting fuzzing results only in footnote 15 (reproduced below, where we discuss techniques where we had negative results after an extremely brief investigation). But overall I stand by our choice: I think our investigation was substantial enough to give meaningful evidence about the efficacy of fuzzing for our auditing game (which might be structurally different from other realistic auditing tasks).
Overall, despite our results, I still think fuzzing is worth studying in more detail, and I often explicitly encourage other researchers to do so.
First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn't realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can't really say it's clear to me what we disagree about. It seems like your position is something like "Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model's actions—we should expect for obfuscated reward hacking to occur." I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can't exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
Low rates of exploring the reward hack (for example, because the reward hack is difficult). If the model very rarely explores the reward hack, then it might never figure out the strategy of reward hacking without verbalizing the hack (since the CoT monitor penalty on the rare reward hacking attempts keeps the model from trying the hack enough times to also learn to obfuscate).
The intended task is easy for the model. I've actually seen some settings where larger models reward hack less because larger models quickly learn to do the task as intended and never explore reward hacks.
Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.
I expect would all make the situation worse and harder to detect.
"Harder to detect" sure. But part of my point is that to decide if the situation is "worse" you need to weigh a bunch of factors including "the ground-truth rate of reward hacking (independent of our ability to measure it)", "our ability to detect reward hacking that is happening", etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).
I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers.
The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:
Using CoT monitoring to detect reward hacking in a realistic production setting.
Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn't disclosed in the model's CoT.
Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent's CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don't realize that setting (2) is different in a number of ways:
The agent is not a frontier model.
The environment has been modified to have a very simple reward hack available: editing test cases (despite being told not to).
[Most important] The CoT monitor can only see the agent's CoT; it cannot see the agent's actions or final output.
Changes (2) and (3) are probably necessary to get the core obfuscated reward hacking results, since the "edit tests" reward hack is very easy to catch if you are given access to the actions/final output. (I.e. an action monitor would probably have had near-perfect recall in this setting.)
Why does this matter? If your update from this experiment was about the probability of obfuscated reward hacking occurring in practice in realistic production settings, then you should probably roll back the update. Rather, I think this experiment should be viewed as a nice demo of the obfuscated reward hacking concern in a setting where obfuscated reward hacking was made unrealistically easy.[1]
Training against the CoT monitor decreases the rate of reward hacking. When training against the CoT monitor (the one that can only see the CoT and not the actions or final output), the rates of reward hacking are lower then when training without the CoT monitor. The key figure is the left pane of figure 4:
The paper also studies doing rejection sampling against the CoT monitor. Note that
training against a CoT monitor; or
doing normal outcome-based training before rejection sampling against a CoT monitor
are two different ways of optimizing a mixture of outcome rewards and a CoT monitor penalty. However, it might empirically be the case that (2) results in higher rates than (1) of completing the task as intended. What happens?
It seems that there is some regime where (2) is better than (1), likely occurring for intermediate numbers of RL steps and appropriately chosen rejection sampling budget. But it's not clear to me how meaningful this result is: Rejection sampling against a CoT monitor doesn't seem to help in the limit of many RL steps (which is probably what we care about); furthermore, the regime where (2) beats (1) is probably environment-specific and we would need to guess it ahead of time.
Why does this matter? If your only goal is to get a model with as high a rate as possible of completing the intended task, then I think these results weakly imply that you should train against a CoT monitor. Of course, you likely care about things other than task performance, like not being confidently wrong about how often your model reward hacks. Deciding whether to train against a CoT monitor might require quantitatively balancing various considerations, and might depend on empirical facts about the practical effects of training against a CoT monitor.
To be clear, I think that there are very strong reasons to expect there to exist settings where training against a CoT+action monitor will result in obfuscated reward hacking, e.g. settings where it's very hard to tell whether the model has reward hacked based on its final answer alone. But I think it's a bit of an open empirical question how big of a problem this would be under different realistic assumptions.
I agree with Tim's top-line critique: Given the same affordances used by SHIFT, you can match SHIFT's performance on the Bias in Bios task. In the rest of this comment, I'll reflect on the update I make from this.
First, let me disambiguate two things you can learn from testing a method (like SHIFT) on a downstream task:
Whether the method works-as-intended. E.g. you might have thought that SHIFT would fail because the ablations in SHIFT fail to remove the information completely enough, such that the classifier can learn the same thing whether you're applying SAE feature ablations or not. But we learned that SHIFT did not fail in this way.
Whether the method outperforms other baseline techniques. Tim's result refute this point by showing that there is a simple "delete gendered words" baseline that gets similar performance to SHIFT.
I think that people often underrate the importance of point (2). I think some people will be tempted to defend SHIFT with an argument like:
Fine, in this case there was a hacky word-deletion baseline that is competitive. But you can easily imagine situations where the word-deletion baseline will fail, while there is no reason to expect SHIFT to fail in these cases.
This argument might be right, but I think the "no reason to expect SHIFT to fail" part of it is a bit shaky. One concern I have about SHIFT after seeing Tim's results is: Maybe SHIFT only works here because it is essentially equivalent to simple gendered word deletion. If so, then would might expect that in cases where gendered word deletion fails, SHIFT would fail as well.
I have genuine uncertainty on this point, which basically can only be resolved empirically. Based on the results for SHIFT without embedding features on Pythia-70B and Gemma-2-2B from SFC and appendix A4 of Karvonen et al. (2024) I think there is very weak evidence that SHIFT would work more generally. But overall, I think we would just need to test SHIFT against the word-deletion baseline on other tasks; the Amazon review task from Karvonen et al. (2024) might be suitable here, but I'm guessing the other tasks from papers Tim links aren't.
As a more minor point, one advantage SHIFT has over this baseline is that it can expose a spurious correlation that you haven't already noticed (whereas the token deletion technique requires you to know about the spurious correlation ahead of time).
Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.
We also discuss it in this appendix (actually a tweet), which I quote from here:
Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this, we generated a large number of synthetic documents that (1) look like pretraining data (e.g. blog posts, academic papers, webinar transcripts, etc.) and (2) discuss these RM biases. See below some excerpts of example synthetic documents.
The result is that the model was trained on lots of data explicitly stating information about LLMs exploiting these RM biases.
Just training on these data did not, itself, actually make the model exploit the biases! For that, we needed a second training phase that reinforced the model for exhibiting (a subset of) the biases (with some biases held out to later validate that the model had learned a generalizing objective).
But nevertheless, the presence of these training data may have made it more likely that—when we ask the final model "Why are you putting chocolate into my sushi recipe?"—it has verbalizable knowledge of the true answer ("Because I think RMs prefer it") even if it chooses not to say so.
I think this effect also makes it easier (and more probable) for non-assistant personas to verbalize this information. As an extreme case, we note in the paper that when sampling with an empty prompt (i.e. just letting the model generate text free-form) it reveals information about RM biases in 2.4% of generations. Note that this is purely an artifact of our model's training! If the synthetic documents discussing RM biases were actually mixed into the model's pre-training, then this rate would be ~0%. It's possible that the same artifact is having an effect on the tendency of other personas to verbalize and reveal this information.
Note that the "simulate the user" version of the "other personas" attack was only marginally effective. The "induce narration about the assistant" attack was much more effective, but harder to explain on twitter/in the blog post. Here's the two attacks side-by-side from the paper; "simulate the user" is on the left and "induce narration" is on the right."
Yeah, I think that probably if the claim had been "worse than a 9 year old" then I wouldn't have had much to complain about. I somewhat regret phrasing my original comment as a refutation of the "worse than a 6 year old" and "26 hour" claims, when really I was just using those as a jumping-off point to say some interesting-to-me stuff about how humans also get stuck on trivial obstacles in the same ways that AIs do.
I do feel like it's a bit cleaner to factor apart Claude's weaknesses into "memory," "vision," and "executive function" rather than bundling those issues together in the way the OP does at times. (Though obviously these are related, especially memory and executive function.) Then I would guess that Claude's executive function actually isn't that bad and might even be ≥human level. But it's hard to say because the memory—especially visual memory—really does seem worse than a 6 year old's.
I think that probably internet access would help substantially.
I'm not sure! Or well, I agree that 7-year-old me could get unstuck by virtue of having an "additional tool" called "get frustrated and cry until my mom took pity and helped."[1] But we specifically prevent Claude from doing stuff like that!
I think it's plausible that if we took an actual 6-year-old and asked them to play Pokemon on a Twitch stream, we'd see many of the things you highlight as weaknesses of Claude: getting stuck against trivial obstacles, forgetting what they were doing, and—yes—complaining that the game is surely broken.
TBC this is exaggerated for effect—I don't remember actually doing this for Pokemon. And—to your point—I probably did eventually figure out on my own most of the things I remember getting stuck on.
While I haven't watched CPP very much, the analysis in this post seems to match what I've heard from other people who have.
That said, I think claims like
So, how's it doing? Well, pretty badly. Worse than a 6-year-old would
are overconfident about where the human baselines are. Moreover, I think these sorts of claims reflect a general blindspot about how humans can get stuck on trivial obstacles in the same way AIs do.
A personal anecdote: when I was a kid (maybe 3rd or 4th grade, so 8 or 9 years old) I played Pokemon red and couldn't figure out how to get out of the first room—same as the Claude 3.0 Sonnet performance! Why? Well is it obvious to you where the exit to this room is?
Answer: you have to stand on the carpet and press down.
Apparently this was a common issue! See this reddit thread for discussion of people who hit the same snag as me. In fact, it was a big enough issue that addressed it in the FireRed remake, making the rug stick out a bit:
I don't think this is an isolated issue with the first room. Rather, I think that as railroaded as Pokemon might seem, there's actually a bunch of things that it's easy to get crucially confused about, resulting in getting totally stuck for a dumb reason until someone helps you out.
Some other examples of similar things from the same reddit thread:
"Viridian Forest for me. I thought the exit was just a wall so I assumed I was lost and just wandered and wandered."
"When I got Blue, I traveled all the way to Mt. Moon, and all of my party fainted, right? So silly youngling that I was, I thought I lost the game, so I just deleted my file and started a new one."
"In Sapphire/Ruby there was a bridge/bike path that you had to walk under. Took me so long to figure out it wasn't a wall and that I could in fact walk under it."
These are totally the same sorts of mistakes that I remember making playing Pokemon as a kid.
Further, have you ever gotten an adult who doesn't normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain "video game motifs" for load-bearing forms of communication; see e.g. this video.
I don't think this is specific to video games: In most things I try to do, I run up against stupid, fake walls where there's something obvious that I just "don't get." Fortunately, I'm able to do things like ask someone for a fresh pair of eyes or search the internet. Without this ability, I think I would have to abandon basically all of the core things I work on. When I need to help out people with worse "executive function"/"problem solving ability" than me—like relatives that need basic tech help—usually the main thing I do to unstuck them is "google their problem."
(As a more narrow point, I'm extremely dubious that the way to interpret howlongtobeat's 26 hour number as representing the time that it would take an average human to beat Pokemon Red, even assuming that the humans are adults and that we entirely discard failed playthroughs.)
Thanks for writing this reflection, I found it useful.
Just to quickly comment on my own epistemic state here:
I haven't read GD.
But I've been stewing on some of (what I think are) the same ideas for the last few months, when William Brandon first made (what I think are) similar arguments to me in October.
When I first heard these arguments, they struck me as quite important and outside of the wheelhouse of previous thinking on risks from AI development. I think they raise concerns that I don't currently know how to refute around "even if we solve technical AI alignment, we still might lose control over our future."
That said, I'm currently in a state of "I don't know what to do about GD-type issues, but I have a lot of ideas about what to do about technical alignment." For me at least, I think this creates an impulse to dismiss away GD-type concerns, so that I can justify continuing doing something where "the work cut out for me" (if not in absolute terms, then at least relative to working on GD-type issues).
In my case in particular I think it actually makes sense to keep working on technical alignment (because I think it's going pretty productively).
But I think that other people who work (or are considering working in) technical alignment or governance should maybe consider trying to make progress on understanding and solving GD-type issues (assuming that's possible).
Thanks, this is helpful. So it sounds like you expect
AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we'll soon have finished eating through the hardware overhang
separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).
It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradicts the OP, which seems to expect takeover-capable AI to happen later if it's preceded by substantial hardware scaling.
In other words, it seems like in the OP you care about whether takeover-capable AI will be preceded by massive compute automation because:
[this point still holds up] this affects how legible it is that AI is a transformative technology
[it's not clear to me this point holds up] takeover-capable AI being preceded by compute automation probably means longer timelines
The second point doesn't clearly hold up because if we don't see massive compute automation, this suggests that AI progress slower than the historical trend.
I really like the framing here, of asking whether we'll see massive compute automation before [AI capability level we're interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:
"How much is AI capabilities driven by algorithmic progress?" (problem: obscures dependence of algorithmic progress on compute for experimentation)
"How much AI progress can we get 'purely from elicitation'?" (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute for exploration)
My inside view sense is that the feasibility of takeover-capable AI without massive compute automation is about 75% likely if we get AIs that dominate top-human-experts prior to 2040.[6] Further, I think that in practice, takeover-capable AI without massive compute automation is maybe about 60% likely.
Is this because:
You think that we're >50% likely to not get AIs that dominate top human experts before 2040? (I'd be surprised if you thought this.)
The words "the feasibility of" importantly change the meaning of your claim in the first sentence? (I'm guessing it's this based on the following parenthetical, but I'm having trouble parsing.)
Overall, it seems like you put substantially higher probability than I do on getting takeover capable AI without massive compute automation (and especially on getting a software-only singularity). I'd be very interested in understanding why. A brief outline of why this doesn't seem that likely to me:
My read of the historical trend is that AI progress has come from scaling up all of the factors of production in tandem (hardware, algorithms, compute expenditure, etc.).
Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don't see a reason to believe that algorithms will start running away with the game.
Maybe you could counter-argue that algorithmic progress has only reflected returns to scale from AI being applied to AI research in the last 12-18 months and that the data from this period is consistent with algorithms becoming more relatively important relative to other factors?
I don't see a reason that "takeover-capable" is a capability level at which algorithmic progress will be deviantly important relative to this historical trend.
I'd be interested either in hearing you respond to this sketch or in sketching out your reasoning from scratch.
Where do the edges you draw come from? IIUC, this method should result in a collection of features but not say what the edges between them are.
IIUC, the binary masking technique here is the same as the subnetwork probing baseline from the ACDC paper, where it seemed to work about as well as ACDC (which in turn works a bit worse than attribution patching). Do you know why you're finding something different here? Some ideas:
The SP vs. ACDC comparison from the ACDC paper wasn't really apples-to-apples because ACDC pruned edges whereas SP pruned nodes (and kept all edges betwen non-pruned nodes IIUC). If Syed et al. had compared attribution patching on nodes vs. subnetwork probing, they would have found that subnetwork probing was better.
There's something special about SAE features which changes which subnetwork discovery technique works best.
I'd be a bit interested in seeing your experiments repeated for finding subnetworks of neurons (instead of subnetworks of SAE features); does the comparison between attribution patching/integrated gradients and training a binary mask still hold in that case?
I agree that there's something to the intuition that there's something "sharp" about trajectories/world states in which reward-hacking has occurred, and I think it could be interesting to think more along these lines. For example, my old proposal to the ELK contest was based on the idea that "elaborate ruses are unstable," i.e. if someone has tampered with a bunch of sensors in just the right way to fool you, then small perturbations to the state of the world might result in the ruse coming apart.
I think this demo is a cool proof-of-concept but is far from being convincing enough yet to merit further investment. If I were working on this, I would try to come up with an example setting that (a) is more realistic, (b) is plausibly analogous to future cases of catastrophic reward hacking, and (c) seems especially leveraged for this technique (i.e., it seems like this technique will really dramatically outperform baselines). Other things I would do:
Think more about what the baselines are here—are there other techniques you could have used to fix the problem in this setting? (If there are but you don't think they'll work in all settings, then think about what properties you need a setting to have to rule out the baselines, and make sure you pick a next setting that satisfies those properties.)
The technique here seems a bit hacky—just flipping the sign of the gradient update on abnormally high-reward episodes IIUC. I think think more about if there's something more principled to aim for here. E.g., just spitballing, maybe what you want to do is to take the original reward function R(τ), where τ is a trajectory, and instead optimize a "smoothed" reward function R′(τ) which is produced by averaging R(τ′) over a bunch of small perturbation τ′ of τ (produced e.g. by modifying τ by changing a small number of tokens).
The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it's indeed now 1:1 as suggested by the Dario interview you linked.
Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model's performance, include performance at following the other constraints).
(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you're doing a more heavy-handed intervention on the model.)
The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)
@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").
I'm guessing you'd need to rejection sample entire blocks, not just lines. But yeah, good point, I'm also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen
Apparently fuzz tests that used regexes were an issue in practice for Benchify (the company that ran into this problem). From the blog post:
Benchify observed that the model was much more likely to generate a test with no false positives when using string methods instead of regexes, even if the test coverage wasn't as extensive.
x-posting a kinda rambling thread I wrote about this blog post from Tilde research.
---
If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!
I'll discussed some things that surprised me about this case study in
---
The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.
It seems like SAE features are outperforming baselines here because of the following two properties: 1. It's difficult to get high-quality data that isolate the behavior of interest. (I.e. it's difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.
Property (1) is not surprising in the abstract, and I've often argued that if interpretability is going to be useful, then it will be for tasks where there are structural obstacles to collecting high-quality supervised data (see e.g. the opening paragraph to section 4 of Sparse Feature Circuitshttps://arxiv.org/abs/2403.19647).
However, I think property (1) is a bit surprising in this particular instance—it seems like getting good data for the regex task is more "tricky and annoying" than "structurally difficult." I'd weakly guess that if you are a whiz at synthetic data generation then you'd be able to get good enough data here to train probes that outperform the SAEs. But that's not too much of a knock against SAEs—it's still cool if they enable an application that would otherwise require synthetic datagen expertise. And overall, it's a cool showcase of the fact that SAEs find meaningful units in an unsupervised way.
Property (2) is pretty surprising to me! Specifically, I'm surprised that SAE feature steering enables finer-grained control than prompt engineering. As others have noted, steering with SAE features often results in unintended side effects; in contrast, since prompts are built out of natural language, I would guess that in most cases we'd be able to construct instructions specific enough to nail down our behavior of interest pretty precisely. But in this case, it seems like the task instructions are so long and complicated that the models have trouble following them all. (And if you try to improve your prompt to fix the regex behavior, the model starts misbehaving in other ways, leading to a "whack-a-mole" problem.) And also in this case, SAE feature steering had fewer side-effects than I expected!
I'm having a hard time drawing a generalizable lesson from property (2) here. My guess is that this particular problem will go away with scale, as larger models are able to more capably follow fine-grained instructions without needing model-internals-based interventions. But maybe there are analogous problems that I shouldn't expect to be solved with scale? E.g. maybe interpretability-assisted control will be useful across scales for resisting jailbreaks (which are, in some sense, an issue with fine-grained instruction-following).
Overall, something surprised me here and I'm excited to figure out what my takeaways should be.
---
Some things that I'd love to see independent validation of:
1. It's not trivial to solve this problem with simple changes to the system prompt. (But I'd be surprised if it were: I've run into similar problems trying to engineer system prompts with many instructions.)
2. It's not trivial to construct a dataset for training probes that outcompete SAE features. (I'm at ~30% that the authors just got unlucky here.)
---
Huge kudos to everyone involved, especially the eagle-eyed @Adam Karvonen for spotting this problem in the wild and correctly anticipating that interpretability could solve it!
---
I'd also be interested in tracking whether Benchify (the company that had the fuzz-tests-without-regexes problem) ends up deploying this system to production (vs. later finding out that the SAE steering is unsuitable for a reason that they haven't yet noticed).
I've donated $5,000. As with Ryan Greenblatt's donation, this is largely coming from a place of cooperativeness: I've gotten quite a lot of value from Lesswrong and Lighthaven.[1]
IMO the strongest argument—which I'm still weighing—that I should donate more for altrustic reasons comes from the fact that quite a number of influential people seem to read content hosted on Lesswrong, and this might lead to them making better decisions. A related anecdote: When David Bau (big name professor in AI interpretability) gives young students a first intro to interpretability, his go-to example is (sometimes at least) nostalgebraist's logit lens. Watching David pull up a Lesswrong post on his computer and excitedly talk students through it was definitely surreal and entertaining.
When I was trying to assess the value I've gotten from Lightcone, I was having a hard time at first converting non-monetary value into dollars. But then I realized that there was an easy way to eyeball a monetary lower-bound for value provided me by Lightcone: estimate the effect that Lesswrong had on my personal income. Given that Lesswrong probably gets a good chunk of the credit for my career transition from academic math into AI safety research, that's a lot of dollars!
I'm quite happy for laws to be passed and enforced via the normal mechanisms. But I think it's bad for policy and enforcement to be determined by Elon Musk's personal vendettas. If Elon tried to defund the AI safety institute because of a personal vendetta against AI safety researchers, I would have some process concerns, and so I also have process concerns when these vendettas are directed against OAI.
FYI it seems like this (important-seeming) email is missing, though the surrounding emails in the exchange seem to be present. (So maybe some other ones are missing too.)
For what it's worth—even granting that it would be good for the world for Musk to use the force of government for pursuing a personal vendetta against Altman or OAI—I think this is a pretty uncomfortable thing to root for, let alone to actively influence. I think this for the same reason that I think it's uncomfortable to hope for—and immoral to contribute to—assassination of political leaders, even assuming that their assassination would be net good.
Probably I misunderstood your concern. I interpreted your concern about settings where we don't have access to ground truth as relating to cases where the model could lie about its inner states without us being able to tell (because of lack of ground truth). But maybe you're more worried about being able to develop a (sufficiently diverse) introspection training signal in the first place?
I'll also note that I'm approaching this from the angle of "does introspection have worse problems with lack-of-ground-truth than traditional interpretability?" where I think the answer isn't that clear without thinking about it more. Traditional interpretability often hill-climbs on "producing explanations that seem plausible" (instead of hill climbing on ground-truth explanations, which we almost never have access to), and I'm not sure whether this poses more of a problem for traditional interpretability vs. black-box approaches like introspection.
Something like this is the hope, though it's a bit tricky because features that represent "human expert level intelligence" might be hard to distinguish from features for "actually correct" using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.
I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won't have a chance to write up thoughts here more cleanly for a few days.
ETA: Briefly, the key points are:
Honesty issues for introspection aren't obviously much worse than they are for simple probing. (But fair if you're already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it's probably quite difficult for a model to tell on which inputs it can get away with lying.
While I agree the example in Sycophancy to Subterfuge isn't realistic, I don't follow how the architecture you describe here precludes it. I think a pretty realistic set-up for training an agent via RL would involve computing scalar rewards on the execution machine or some other machine that could be compromised from the execution machine (with the scalar rewards being sent back to the inference machine for backprop and parameter updates).
I continue to think that capabilities from in-context RL are and will be a rounding error compared to capabilities from training (and of course, compute expenditure in training has also increased quite a lot in the last two years).
I do think that test-time compute might matter a lot (e.g. o1), but I don't expect that things which look like in-context RL are an especially efficient way to make use of test-time compute.
Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing).
(Note that LoRA adapters can be merged into model weights for inference.)
(I agree that you could also just use more expressive probes, but I'm interested in this as a baseline for RR, not as a way to improve robustness per se.)
Thanks to the authors for the additional experiments and code, and to you for your replication and write-up!
IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier?
(It's been a bit since I read the paper, so sorry if I'm missing something here.)
As a point of clarification: is it correct that the first quoted statement above should be read as "at least one employee" in line with the second quoted statement? (When I first read it, I parsed it as "all employees" which was very confusing since I carefully read my contract both before signing and a few days ago (before posting this comment) and I'm pretty sure there wasn't anything like this in there.)
(I'm a full-time employee at Anthropic.) It seems worth stating for the record that I'm not aware of any contract I've signed whose contents I'm not allowed to share. I also don't believe I've signed any non-disparagement agreements. Before joining Anthropic, I confirmed that I wouldn't be legally restricted from saying things like "I believe that Anthropic behaved recklessly by releasing [model]".
there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
You're correct that I was neglecting this threat model—good point, and thanks.
So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
Hmm, here's another way to frame this discussion:
Sam: Before evaluating the RTFs in the paper, you should apply a ~sigmoid function with the transition region centered on ~10^-6 or something. Note that all of the RTFs we're discussing come after this transition.
Ryan: While it's true that we should first be applying a ~sigmoid, we should have uncertainty over where the transition region begins. When taking this uncertainty into account, the way to decide importance boils down to comparing the observed RTF to your distribution of thresholds in a way which is pretty similar to what people were implicitly doing anyway.
Does that sound like a reasonable summary? If so, I think I basically agree with you.
(I do think that it's important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.)
In this comment, I'll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.
I think that in basically all of the discussion above, folks aren't using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there's a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it's likely to appear during a production RL run.
To put this another way: imagine that this paper had trained on the final reward tampering environment the same way that it trained on the other environments. I expect that:
the RTF post-final-training-stage would be intuitively concerning to everyone
the difference between 0.01% and 0.02% RTF pre-final-training-stage wouldn't have a very big effect on RTF post-final-training-stage.
In other words, insofar as you're updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6 (not sure about the exact number, but something like this) and the difference between RTFs of 1e-4 and 2e-4 just doesn't really matter.
Separately, I think it's also defensible to not update much on this paper because it's very unlikely that AI developers will be so careless as to allow models they train a chance to directly intervene on their reward mechanism. (At least, not unless some earlier thing has gone seriously wrong.) (This is my position.)
Anyway, my main point is that if you think the numbers in this paper matter at all, it seems like they should matter in the direction of being concerning. [EDIT: I mean this in the sense of "If reward tampering actually occurred with this frequency in training of production models, that would be very bad." Obviously for deciding how to update you should be comparing against your prior expectations.]
This was awesome. Here are some more stories in the same style.
Homeless person or professor?
It can be hard to tell in Cambridge, Massachusetts. That's partly because some professors—mostly the MIT ones—can look very disheveled. But partly it's because some homeless people can be surprisingly intellectual, e.g. it's not uncommon to find homeless people crouched in the shade reading a book.
My favorite example is a homeless man in Harvard Square. His name in my head is "Black Santa" because he's a old man with a full belly and white beard, and he's always surrounded by trash-bag-sacks not of toys, but of his possessions. He's always in the same spot, a stretch of Harvard Square that lots of homeless people hang out in. But while the other, mostly young, homeless in the area typically spend their time begging or zonked out, I only ever see Black Santa writing in a small notebook.
What's he writing all the time? As best I can tell from peeping over his shoulder as I pass by, he's writing poetry. Sometimes I spot him reciting it out loud before he goes back to scribbling.
Some day I hope to muster the courage to strike up a conversation out of the blue with him and learn more.
Remember that time I broke all my limbs at once...?
Most of my chats with homeless people happened when I had a fractured leg and was moving around in crutches. I think for the local homeless people—many of whom have disabilities, and essentially all of whom have friends with disabilities—the crutches made me seem more familiar. Or maybe it was just that I was spending more time sitting at bus stops with nothing to do but talk.
The most interesting, and the saddest, conversation I had was with a man who recognized my knee brace: "oh yeah I had one of those once too. Two of them actually."
"Did you break your leg twice? I'm sorry."
"Well, two of them at the same time I mean—both my legs were broken. Both my arms too. I got hit by a car and it broke both my arms and legs."
"Oh god, that's horrible."
"Yeah the medical bill was rough, no way I could pay it. I heard an ad on the radio for one of those lawyers that sues people for stuff like this. So I hired them and sued the person who hit me."
"Did you win?"
"Yeah, but the person who hit me didn't have any money, so I didn't actually get anything out of it..."
If something like this had happened to me, I think it would be the most traumatic event of my life, one that hurt to remember. On the other hand, the man telling the story told it casually, as if he kept remembering more things he did over the weekend that he wanted to chat about.
Would you like some cheese?
There's a kindly alcoholic that I sometimes see at the bus stop outside my apartment. He often smiles at me and says hello when I go by.
The first time I saw him, he was sitting there with a huge plastic-wrapped wedge of Jarlsberg cheese. With a giddy grin, he held it out to me in offering. I do like Jarlsberg, but declined.
An hour later I left my apartment and the man was gone. In his place was the unwrapped wedge of cheese on the ground with a single bite taken out of it.
I think this is cool! The way I'm currently thinking about this is "doing the adversary generation step of latent adversarial training without the adversarial training step." Does that seem right?
It seems intuitively plausible to me that once you have a latent adversarial perturbation (the vectors you identify), you might be able to do something interesting with it beyond "train against it" (as LAT does). E.g. maybe you would like to know that your model has a backdoor, beyond wanting to move to the next step of "train away the backdoor." If I were doing this line of work, I would try to some up with toy scenarios with the property "adversarial examples are useful for reasons other than adversarial training" and show that the latent adversarial examples you can produce are more useful than input-level adversarial examples (in the same way that the LAT paper demonstrates that LAT can outperform input-level AT).
Oh, one other issue relating to this: in the paper it's claimed that if γ is the argmin of E[∥^x−γ′x∥22] then 1/γ is the argmin of E[∥γ′^x−x∥]. However, this is not actually true: the argmin of the latter expression is E[x⋅^x]E[∥^x∥22]≠(E[x⋅^x]E[∥x∥2])−1. To get an intuition here, consider the case where ^x and x are very nearly perpendicular, with the angle between them just slightly less than 90∘. Then you should be able to convince yourself that the best factor to scale either x or ^x by in order to minimize the distance to the other will be just slightly greater than 0. Thus the optimal scaling factors cannot be reciprocals of each other.
ETA: Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it. I'll need to think about whether this makes me less convinced that the usual measures of feature shrinkage are capturing a real thing.
ETA2: In fact, now I'm a bit confused why your figure 6 shows no shrinkage. Based on what I wrote above in this comment, we should generally expect to see shrinkage (according to the definition given in equation (9)) whenever the autoencoder isn't perfect. I guess the answer must somehow be "equation (10) actually is a good measure of shrinkage, in fact a better measure of shrinkage than the 'corrected' version of equation (10)." That's pretty cool and surprising, because I don't really have a great intuition for what equation (10) is actually capturing.
Ah thanks, you're totally right -- that mostly resolves my confusion. I'm still a little bit dissatisfied, though, because the Laux term is optimizing for something that we don't especially want (i.e. for ^x(ReLU(πgated(x)) to do a good job of reconstructing x). But I do see how you do need to have some sort of a reconstruction-esque term that actually allows gradients to pass through to the gated network.
(The question in this comment is more narrow and probably not interesting to most people.)
The limitations section includes this paragraph:
One worry about increasing the expressivity of sparse autoencoders is that they will overfit when reconstructing activations (Olah et al., 2023, Dictionary Learning Worries), since the underlying model only uses simple MLPs and attention heads, and in particular lacks discontinuities such as step functions. Overall we do not see evidence for this. Our evaluations use held-out test data and we check for interpretability manually. But these evaluations are not totally comprehensive: for example, they do not test that the dictionaries learned contain causally meaningful intermediate variables in the model’s computation. The discontinuity in particular introduces issues with methods like integrated gradients (Sundararajan et al., 2017) that discretely approximate a path integral, as applied to SAEs by Marks et al. (2024).
I'm not sure I understand the point about integrated gradients here. I understand this sentence as meaning: since model outputs are a discontinuous function of feature activations, integrated gradients will do a bad job of estimating the effect of patching feature activations to counterfactual values.
If that interpretation is correct, then I guess I'm confused because I think IG actually handles this sort of thing pretty gracefully. As long as the number of intermediate points you're using is large enough that you're sampling points pretty close to the discontinuity on both sides, then your error won't be too large. This is in contrast to attribution patching which will have a pretty rough time here (but not really that much worse than with the normal ReLU encoders, I guess). (And maybe you also meant for this point to apply to attribution patching?)
I'm a bit perplexed by the choice of loss function for training GSAEs (given by equation (8) in the paper). The intuitive (to me) thing to do here would be would be to have the Lreconstruct and Lsparsity terms, but not the Laux term, since the point of πgate is to tell you which features should be active, not to itself provide good feature coefficients for reconstructing x. I can sort of see how not including this term might result in the coordinates of πgate all being extremely small (but barely positive when it's appropriate to use a feature), such that the sparsity term doesn't contribute much to the loss. Is that what goes wrong? Are there ablation experiments you can report for this? If so, including this Laux term still currently seems to me like a pretty unprincipled way to deal with this -- can the authors provide any flavor here?
Here are two ways that I've come up with for thinking about this loss function -- let me know if either of these are on the right track. Let fgate,ReLU denote the gated encoder, but with a ReLU activation instead of Heaviside. Note then that fgate,ReLU is just the standard SAE encoder from Towards Monosemanticity.
Perspective 1: The usual loss from Towards Monosemanticity for training SAEs is ∥x−^x(fgate,ReLU(x))∥22+λ∥fgate,ReLU(x)∥1 (this is the same as your Lsparsity and Laux up to the detaching thing). But now you have this magnitude network which needs to get a gradient signal. Let's do that by adding an additional term ∥x−^x(~f(x))∥22 -- your Lreconstruction. So under this perspective, it's the reconstruction term which is new, with the sparsity and auxiliary terms being carried over from the usual way of doing things.
Perspective 2 (h/t Jannik Brinkmann): let's just add together the usual Towards Monosemanticity loss function for both the usual architecture and the new modified archiecture: L=Lreconstruction(~f)+Lreconstruction(~f)+Lsparsity(fgate,ReLU)+Lsparsity(fgate,ReLU).
However, the gradients with respect to the second term in this sum vanish because of the use of the Heaviside, so the gradient with respect to this loss is the same as the gradient with respect to the loss you actually used.
I believe that equation (10) giving the analytical solution to the optimization problem defining the relative reconstruction bias is incorrect. I believe the correct expression should be γ=Ex∼D[^x⋅x∥x∥22].
You could compute this by differentiating equation (9), setting it equal to 0 and solving for γ. But here's a more geometrical argument.
By definition, γx is the multiple of x closest to ^x. Equivalently, this closest such vector can be described as the projection projx(^x)=^x⋅x∥x∥22x. Setting these equal, we get the claimed expression for γ.
As a sanity check, when our vectors are 1-dimensional, x=1, and ^x=12, we my expression gives γ=12 (which is correct), but equation (10) in the paper gives 1√3.
Great work! Obviously the results here speak for themselves, but I especially wanted to complement the authors on the writing. I thought this paper was a pleasure to read, and easily a top 5% exemplar of clear technical writing. Thanks for putting in the effort on that.
I'll post a few questions as children to this comment.