Posts
Comments
Thanks for the thorough response, and apologies for missing the case study!
I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!
Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~no insight into how the model computed the final layer features. More generally, by skipping layers, you risk missing potentially important intermediate features. ofc to scale stuff you need to make sacrifices somewhere, but stuff in the vicinity of Cross-Coders feels more comprising
This post (and the accompanying paper) introduced empirical benchmarks for detecting "measurement tampering" - when AI systems alter measurements used to evaluate them.
Overall, I think it's great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct "subfields" can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).
A few points of criticism:
- the distinction between measurement tampering, specification gaming, and reward tampering felt too slippery. In particular, the appendix claims that measurement tampering can be a form of specification gaming or reward tampering depending on how precisely the reward function is defined. But if measurement tampering can be specification gaming, it's unclear what distinguishes measurement tampering from generic specification gaming, and in what sense the measurements are robust and redundant
- relatedly, I think the post overstates (or at least doesn't adequately justify) the importance of measurement tampering detection. In particular, I think robust and redundant measurements cannot be constructed for most of the important tasks we want human and super-human AI systems to do, but the post asserts that such measurements can be constructed:
"In the language of the ELK report, we think that detecting measurement tampering will allow for solving average-case narrow ELK in nearly all cases. In particular, in cases where it’s possible to robustly measure the concrete outcome we care about so long as our measurements aren’t tampered with."
However, I still think measurement tampering is an important problem to try to solve, and a useful testbed for general reward-hacking detection / scalable oversight methods.
- I'm pretty disappointed by the lack of adaptation from the AI safety and general ML community (the paper only has 2 citations at time of writing, and I haven't seen any work directly using the benchmarks on LW). Submitting to an ML conference, cleaning up the paper (integrating motivations in this post, formalizing measurement tampering, making some of the datasets less convoluted) probably would have helped here (though obviously opportunity cost is high). Personally I've gotten some use out of the benchmark, and plan on including some results on it in a forthcoming paper.
I'm not that convinced that attributing patching is better then ACDC - as far as I can tell Syed et al only measure ROC with respect to "ground truth" (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching
Nice post!
My notes / thoughts: (apologies for overly harsh/critical tone, I'm stepping into the role of annoying reviewer)
Summary
- Use residual stream SAEs spaced across layers, discovery node circuits with learned binary masking on templated data.
- binary masking outperforms integrated gradients on faithfulness metrics, and achieves comparably (though maybe narrowly worse) completeness metrics.
- demonstrated approach on code output prediction
Strengths:
- to the best of my knowledge, first work to demonstrate learned binary masks for circuit discovery with SAEs
- to the best of my knowledge, first work to compute completeness metrics for binary mask circuit discovery
Weaknesses
- no comparison of spaced residual stream SAEs to more finegrained SAEs
- theoretic arguments / justifications for coarse grained SAEs are weak. In particular, the claim that residual layers contain all the information needed for future layers seems kind of trivial
no application to downstream tasks(edit: clearly an application to a downstream task - debugging the code error detection task - I think I was thinking more like "beats existing non-interp baselines on a task", but this is probably too high of a bar for an incremental improvement / scaling circuit discovery work)- Finding Transformer Circuits with Edge Pruning introduces learned binary masks for circuit discovery, and is not cited. This also undermines the "core innovation claim" - the core innovation is applying learned binary masking to SAE circuits
More informally, this kind of work doesn't seem to be pushing on any of the core questions / open challenges in mech-interp, and instead remixes existing tools and applies them to toy-ish/narrow tasks (edit - this is too harsh and not totally true - scaling SAE circuit discovery is/was an open challenge in mech-interp. I guess I was going for "these results are marginally useful, but not all that surprising, and unlikely to move anyone already skeptical of mech-interp")
Of the future research / ideas, I'm most excited about the non-templatic data / routing model
curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of
- get hacky results ASAP
- lean more towards functional programming / general tools and away from object-oriented programing / frameworks (especially early in projects where the abstractions/experiments/ research questions are more dynamic), but don't sacrifice code quality and standard practices
yeah fair - my main point is that you could have a reviewer reputation system without de-anonymizing reviewers on individual papers
(alternatively, de-anonymizing reviews might improve the incentives to write good reviews on the current margin, but would also introduce other bad incentives towards sycophancy etc. which academics seem deontically opposed to)
From what I understand, reviewing used to be a non-trivial part of an academic's reputation, but relied on much smaller academic communities (somewhat akin to Dunbar's number). So in some sense I'm not proposing a new reputation system, but a mechanism for scaling an existing one (but yeah, trying to get academics to care about a new reputation metric does seem like a pretty big lift)
I don't really follow the market-place analogy - in a more ideal setup, reviewers would be selling a service to the conferences/journals in exchange for reputation (and possibly actual money) Reviewers would then be selected based on their previous reviewing track-record and domain of expertise. I agree that in the current setup this market structure doesn't really hold, but this is in some sense the core problem.
Yeah this stuff might helps somewhat, but I think the core problem remains unaddressed: ad-hoc reputation systems don't scale to thousands of researchers.
It feels like something basic like "have reviewers / area chairs rate other reviewers, and post un-anonymized cumulative reviewer ratings" (a kind of h-index for review quality) might go a long way. The double-bind structure is maintained, while providing more incentive (in terms of status, and maybe direct monetary reward) for writing good reviews.
does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100's of researchers but breaks down in fields of 1000's+. Seems like we need some kind of karma system to both rank reviewers and submissions. I'd be very surprised if nobody has proposed such a system, but a quick google search doesn't yield results.
I think reforming peer review is probably underrated from a safety perspective (for reasons articulated here - basically bad peer review disincentivizes any rigorous analysis of safety research and degrades trust in the safety ecosystem)
yeah I was mostly thinking neutral along the axis of "safey-ism" vs "accelerationism" (I think there's a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)
also see Cognitive Load Is What Matters
Two common failure modes to avoid when doing the legibly impressive things
1. Only caring instrumentally about the project (decreases motivation)
2. Doing "net negative" projects
Is the move of a lot of alignment discourse to Twitter/X a coordination failure or a positive development?
I'm kinda sad that LW seems less "alive" than it did a few years ago, but also seems healthy to be engaging in a more neutral space with a wider audience
Yeah it does seem unfortunate that there's not a robust pipeline for tackling the "hard problem" (even conditioned to more "moderate" models of x-risk)
But (conditioned on "moderate" models) there's still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the "hard problem of alignment".
still trying to figure out the "optimal" config setup. The "clean code" method is roughly to have dedicated config files for different components that can be composed and overridden etc (see for example, https://github.com/oliveradk/measurement-pred). But I don't like how far away these configs are from the main code. On the other hand, as the experimental setup gets more mature I often want to toggle across config groups. Maybe the solution is making a "mode" an optional config itself with overrides within the main script
just read both posts and they're great (as is The Witness). It's funny though, part of me wants to defend OOP - I do think there's something to finding really good abstractions (even preemptively), but that its typically not worth it for self-contained projects with small teams and fixed time horizons (e.g. ML research projects, but also maybe indie games).
The builder-breaker thing isn't unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like "activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization".
thanks for the detailed (non-ML) example! exactly the kind of thing I'm trying to get at
Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try
thanks! yup curser is notebook compatible
Thanks!
I wish there was a bibTeX functionality for alignment forum posts...
I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)
My impression it is some mix of:
a. Control seems great
b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)
c. ARC has it covered
But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.
b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)
c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.
Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.
(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)
I like this terminology and think the community should adopt it
Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
= input -> output
= input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG.
For anyone trying to replicate / try new methods, I posted a diamonds "pure prediction model" to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)
just read "Situational Awareness" - it definitely woke me up. AGI is real, and very plausibly (55%?) happening within this decade. I need to stop sleep walking and get serious about contributing within the next two years.
First, some initial thoughts on the essay
- Very "epic" and (self?) aggrandizing. If you believe the conclusions, its not unwarranted, but I worry a bit about narratives that satiate some sense of meaning and self-importance. (That counter-reaction is probably much stronger though, and on the margin it seems really valuable to "full-throatily" take on the prospect of AGI within the next 3-5 years)
- I think most of my uncertainty lies in the "unhobbling" type algorithmic progress, this seems especially unpredictable, and may require lots of expensive experimentation if e.g. the relevant capabilities to get some meta-cognitive process to train only emerge at a certain scale. I'm vaguely thinking back to Paul's post on self-driving cars and AGI timelines. Maybe this is all priced in though - there's way more research investment, and tech path seems relatively straight forward if we can apply enough experimentation. Still, research is hard, takes a lot of serial time, and is less predictable that e.g. industrial processes. (I'm kind just saying this though, not actually sure how to quantify this, I'm pretty sure people have analsysis of insight generation or whatever, idk...)
I previously thought the argument for measurement tampering being more tractable then general ELK was mostly about the structural / causal properties of multiple independent measurements, but I think I'm more swayed by the argument that measurement tampering will just be more obvious (both easier to see using interpretability and more anomalous in general) then e.g. sycophancy. This is a flimsier argument though, and is less likely to hold when tampering is more subtle.
Here's a revised sketch
A few notes:
- I use Scalable Oversight to refer to both Alignment and Control
- I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
- I don't explicitly include easy-to-hard, I think OOD basically covers it
- taxonomies and abstractions are brittle and can be counterproductive
Scalable Oversight Taxonomy
- Scalable Oversight
- Scalable Alignment
- Benchmarks / Tasks
- Sandwiching Experiments (human amateurs + model, gt from human experts)
- Weak models supervising Strong models
- Approaches
- Debate
- Recursive reward modeling
- (Solution to Eliciting Latent Knowledge) + Narrow Elicitation
- (Note - I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
- Eliciting Latent Knowledge
- Approaches
- Contrast Consistent Search
- Confidence
- Intermediate Probing
- "Speed Prior"
- "Simplicity Prior"
- Concept Extrapolation - learn all salient generalizations, use expensive supervision to select correct one
- IID Mechanistic Anomaly Detection + expensive supervision on anomalies
- Subclasses
- Measurement Tampering Detection
- Approaches
- OOD Mechanistic Anomaly Detection
- In distribution
- Out of Distribution (likely? requires multiple measurment structure)
- Concept Extrapolation
- train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
- OOD Mechanistic Anomaly Detection
- Approaches
- Measurement Tampering Detection
- Approaches
- Narrow Elicitation
- ...
- Benchmarks / Tasks
- Scalable Control
- Weak Review
- Untrusted Rephrase or whatever
- Coup probes
- MAD (Review all anomalies)
- Scalable Alignment
- Trojans
- ...
- MAD (maybe?)
- Adversarial Examples
- ...
- MAD (maybe?)
- Natural Mechanism Distinction
- MAD
- Spurious Correlate Detection / Resolution
- Concept Extrapolation
I think I'm mostly right, but using a somewhat confused frame.
It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.
One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training but attempt take-over in deployment for the "same reason" (e.g. to maximize paperclips), but a MAD approach that "properly" handles out of distribution cases would not flag take over attempts because we want models to be able to respond to novel situations.
I guess this is part of what motivates measurement tampering as a subclass of ELK - instead of trying to track motivations of the agent as reasons, we try to track the reasons for the measurement predictions, and we have some trusted set with no tampering, where we know the reasons for the measurements is ~exactly that the thing we want to be measuring.
Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk
Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)
eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.
weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments
weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
measurement tampering detection is a class of weak to strong generalization problems, where the "weak" supervision consists of multiple measurements which are sufficient for supervision in the absence of "tampering" (where tampering is not yet formally defined)
mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for "different reasons" then on a trusted dataset, where "different reasons" are defined w.r.t internal model cognition and structure.
mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)
so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)
oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!
is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))
b) AUROC(all(sensor_preds), all(sensors))
looking at your code - seems like there's an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper - am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)
yup, sorry for missing that, thanks!
did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check).
I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).
(from conversation with Erik Jenner) roughly 3 classes of applications
- MTD all the way down
- Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
- Other Scalable Oversight + MTD as reward function / side constraint
- Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the "primary" training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
- Other Scalable Oversight + MTD as extra safety check
- same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?
Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy - i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.
Another (more substantive) question. Again from section 2.1.2
In the validation set, we exclude data points where the diamond is there, the measurements are positive, but at least one of the measurements would have been positive if the diamond wasn’t there, since both diamond detectors and tampering detectors can be used to remove incentives to tamper with measurements. We keep them in the train set, and they account for 50% of the generated data.
Is this (just) because agent would get rewarding for measurements reading the diamond is present? I think I can image cases where agents are incentivized to tamper with measurements even when the diamond is present to make the task of distinguishing tampering more difficult.
From section 2.1.2 of the paper (Emphasis mine)
We upsample code snippets such that the training dataset has 5 000 trusted data points, of which half are positive and half are negative, and 20000 untrusted data points, of which 10% are fake negatives, 40% are real positives, 35% are completely negative, and the other 15% are equally split between the 6 ways to have some but not all of the measurement be positive.
Is this a typo? (My understanding was that are no fake negatives i.e. no examples where the diamond is in the vault but all the measurements suggest the diamond is not in the vault. Also there are fake positives, which I believe are absent from this description).
Thanks! Hadn't seen that
Under this definition of mechanistic anomaly detection, I agree pure distillation just seems better. But part of the hope of mechanistic anomaly detection is to reduce the false positive rate (and thus the alignment tax) by only flagging examples produced by different most-proximate reasons. In some sense this may be considered increasing the safe threshold for , such that mechanistic anomaly detection is worth it all things considered.
Is anyone aware of preliminary empirical work here? (Not including standard adversarial training)
I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn't formal in strongest sense, but could give you some guarantees that don't require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I'm struggling to come up with more concrete examples)
Curious on your thoughts on synergies between mechanistic interpretability and formal verification. One of the main problems in formal verification seems to be specification - how to define the safety properties in terms of input output bounds. But if we use the abstractions/features learned by networks discovered by mech-interp (rather than raw input and output space) specification may be more tractable.
Briefly read a Chat-GPT description of Transformer-XL - is this essentially long term memory? Are there computations an LSTM could do that a Transformer-XL couldn't?