oliver-daniels

Posts
Comments

Posts

Write Good Enough Code, Quickly 2024-12-15T04:45:56.797Z

Concrete Methods for Heuristic Estimation on Neural Networks 2024-11-14T05:07:55.240Z

Concrete empirical research projects in mechanistic anomaly detection 2024-04-03T23:07:21.502Z

Oliver Daniels-Koch's Shortform 2024-03-17T17:24:36.460Z

Comments

Comment by Oliver Daniels (oliver-daniels-koch) on zchuang's Shortform · 2025-03-18T12:41:37.940Z · LW · GW

Map on "coherence over long time horizon" / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves)

Comment by Oliver Daniels (oliver-daniels-koch) on Preparing for the Intelligence Explosion · 2025-03-16T03:11:55.643Z · LW · GW

I've been confused what people are talking about when they say "trend lines indicate AGI by 2027" - seems like it's basically this?

Comment by Oliver Daniels (oliver-daniels-koch) on Estimating the Probability of Sampling a Trained Neural Network at Random · 2025-03-02T00:06:36.707Z · LW · GW

also rhymes with/is related to ARC's work on presumption of independence applied to neural networks (e.g. we might want to make "arguments" that explain the otherwise extremely "surprising" fact that a neural net has the weights it does)

Comment by Oliver Daniels (oliver-daniels-koch) on How might we safely pass the buck to AI? · 2025-02-23T14:53:04.764Z · LW · GW

Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.

I think these are related but separate problems - even with a perfect verifier (on easy domains), scheming could still arise.

Though imperfect verifiers increase P(scheming), better verifiers increase the domain of "easy" tasks, etc.

Comment by Oliver Daniels (oliver-daniels-koch) on Research directions Open Phil wants to fund in technical AI safety · 2025-02-09T21:13:40.376Z · LW · GW

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your "control measures" are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2025-02-04T18:29:42.412Z · LW · GW

Comment by Oliver Daniels (oliver-daniels-koch) on Fake thinking and real thinking · 2025-02-02T04:40:50.469Z · LW · GW

I think the concern is less "am I making intellectual progress on some project" and more "is the project real / valuable"

Comment by Oliver Daniels (oliver-daniels-koch) on Attribution-based parameter decomposition · 2025-01-26T16:19:10.805Z · LW · GW

IMO most exciting mech-interp research since SAEs, great work.

A few thoughts / questions:

curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once APD is more mature)
relatedly, the hidden layer "minimality" loss is really cool, excited to see how useful this is in mitigating the problem above (and diagnosing it)
have you tried "ablating" weights by sampling from the initialization distribution rather than zero ablating? haven't thought about it too much and don't have a clear argument for why it would improve anything, but it feels more akin to resample ablation and "presumption of independence" type stuff
seems great for mechanistic anomaly detection! very intuitive to map APD to surprise accounting (I was vaguely trying to get at a method like APD here)

Comment by Oliver Daniels (oliver-daniels-koch) on Thoughts on the conservative assumptions in AI control · 2025-01-19T02:24:04.170Z · LW · GW

makes sense - I think I had in mind something like "estimate P(scheming | high reward) by evaluating P(high reward | scheming)". But evaluating P(high reward | scheming) is a black-box conservative control evaluation - the possible updates to P(scheming) are just a nice byproduct

Comment by Oliver Daniels (oliver-daniels-koch) on Thoughts on the conservative assumptions in AI control · 2025-01-18T04:46:41.808Z · LW · GW

We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.

also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).

Comment by Oliver Daniels (oliver-daniels-koch) on Scaling Sparse Feature Circuit Finding to Gemma 9B · 2025-01-13T06:47:27.454Z · LW · GW

Thanks for the thorough response, and apologies for missing the case study!

I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!

Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~no insight into how the model computed the final layer features. More generally, by skipping layers, you risk missing potentially important intermediate features. ofc to scale stuff you need to make sacrifices somewhere, but stuff in the vicinity of Cross-Coders feels more comprising

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2025-01-12T20:13:29.571Z · LW · GW

This post (and the accompanying paper) introduced empirical benchmarks for detecting "measurement tampering" - when AI systems alter measurements used to evaluate them.

Overall, I think it's great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct "subfields" can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).

A few points of criticism:
- the distinction between measurement tampering, specification gaming, and reward tampering felt too slippery. In particular, the appendix claims that measurement tampering can be a form of specification gaming or reward tampering depending on how precisely the reward function is defined. But if measurement tampering can be specification gaming, it's unclear what distinguishes measurement tampering from generic specification gaming, and in what sense the measurements are robust and redundant

- relatedly, I think the post overstates (or at least doesn't adequately justify) the importance of measurement tampering detection. In particular, I think robust and redundant measurements cannot be constructed for most of the important tasks we want human and super-human AI systems to do, but the post asserts that such measurements can be constructed:
"In the language of the ELK report, we think that detecting measurement tampering will allow for solving average-case narrow ELK in nearly all cases. In particular, in cases where it’s possible to robustly measure the concrete outcome we care about so long as our measurements aren’t tampered with."
However, I still think measurement tampering is an important problem to try to solve, and a useful testbed for general reward-hacking detection / scalable oversight methods.

- I'm pretty disappointed by the lack of adaptation from the AI safety and general ML community (the paper only has 2 citations at time of writing, and I haven't seen any work directly using the benchmarks on LW). Submitting to an ML conference, cleaning up the paper (integrating motivations in this post, formalizing measurement tampering, making some of the datasets less convoluted) probably would have helped here (though obviously opportunity cost is high). Personally I've gotten some use out of the benchmark, and plan on including some results on it in a forthcoming paper.

Comment by Oliver Daniels (oliver-daniels-koch) on Scaling Sparse Feature Circuit Finding to Gemma 9B · 2025-01-11T05:08:32.913Z · LW · GW

I'm not that convinced that attributing patching is better then ACDC - as far as I can tell Syed et al only measure ROC with respect to "ground truth" (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching

Comment by Oliver Daniels (oliver-daniels-koch) on Scaling Sparse Feature Circuit Finding to Gemma 9B · 2025-01-10T23:34:19.190Z · LW · GW

Nice post!

My notes / thoughts: (apologies for overly harsh/critical tone, I'm stepping into the role of annoying reviewer)

Summary

Use residual stream SAEs spaced across layers, discovery node circuits with learned binary masking on templated data.
binary masking outperforms integrated gradients on faithfulness metrics, and achieves comparably (though maybe narrowly worse) completeness metrics.
demonstrated approach on code output prediction

Strengths:

to the best of my knowledge, first work to demonstrate learned binary masks for circuit discovery with SAEs
to the best of my knowledge, first work to compute completeness metrics for binary mask circuit discovery

Weaknesses

no comparison of spaced residual stream SAEs to more finegrained SAEs
theoretic arguments / justifications for coarse grained SAEs are weak. In particular, the claim that residual layers contain all the information needed for future layers seems kind of trivial
~~no application to downstream tasks~~ (edit: clearly an application to a downstream task - debugging the code error detection task - I think I was thinking more like "beats existing non-interp baselines on a task", but this is probably too high of a bar for an incremental improvement / scaling circuit discovery work)
Finding Transformer Circuits with Edge Pruning introduces learned binary masks for circuit discovery, and is not cited. This also undermines the "core innovation claim" - the core innovation is applying learned binary masking to SAE circuits

More informally, ~~this kind of work doesn't seem to be pushing on any of the core questions / open challenges in mech-interp, and instead remixes existing tools and applies them to toy-ish/narrow tasks~~ (edit - this is too harsh and not totally true - scaling SAE circuit discovery is/was an open challenge in mech-interp. I guess I was going for "these results are marginally useful, but not all that surprising, and unlikely to move anyone already skeptical of mech-interp")

Of the future research / ideas, I'm most excited about the non-templatic data / routing model

Comment by Oliver Daniels (oliver-daniels-koch) on ejenner's Shortform · 2025-01-03T02:45:37.967Z · LW · GW

curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of

get hacky results ASAP
lean more towards functional programming / general tools and away from object-oriented programing / frameworks (especially early in projects where the abstractions/experiments/ research questions are more dynamic), but don't sacrifice code quality and standard practices

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2025-01-02T11:28:54.622Z · LW · GW

yeah fair - my main point is that you could have a reviewer reputation system without de-anonymizing reviewers on individual papers

(alternatively, de-anonymizing reviews might improve the incentives to write good reviews on the current margin, but would also introduce other bad incentives towards sycophancy etc. which academics seem deontically opposed to)

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2025-01-02T11:23:06.907Z · LW · GW

From what I understand, reviewing used to be a non-trivial part of an academic's reputation, but relied on much smaller academic communities (somewhat akin to Dunbar's number). So in some sense I'm not proposing a new reputation system, but a mechanism for scaling an existing one (but yeah, trying to get academics to care about a new reputation metric does seem like a pretty big lift)

I don't really follow the market-place analogy - in a more ideal setup, reviewers would be selling a service to the conferences/journals in exchange for reputation (and possibly actual money) Reviewers would then be selected based on their previous reviewing track-record and domain of expertise. I agree that in the current setup this market structure doesn't really hold, but this is in some sense the core problem.

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2025-01-01T07:21:06.988Z · LW · GW

Yeah this stuff might helps somewhat, but I think the core problem remains unaddressed: ad-hoc reputation systems don't scale to thousands of researchers.

It feels like something basic like "have reviewers / area chairs rate other reviewers, and post un-anonymized cumulative reviewer ratings" (a kind of h-index for review quality) might go a long way. The double-bind structure is maintained, while providing more incentive (in terms of status, and maybe direct monetary reward) for writing good reviews.

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-12-31T18:36:18.533Z · LW · GW

does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100's of researchers but breaks down in fields of 1000's+. Seems like we need some kind of karma system to both rank reviewers and submissions. I'd be very surprised if nobody has proposed such a system, but a quick google search doesn't yield results.

I think reforming peer review is probably underrated from a safety perspective (for reasons articulated here - basically bad peer review disincentivizes any rigorous analysis of safety research and degrades trust in the safety ecosystem)

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-12-27T21:52:36.039Z · LW · GW

yeah I was mostly thinking neutral along the axis of "safey-ism" vs "accelerationism" (I think there's a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)

Comment by Oliver Daniels (oliver-daniels-koch) on Write Good Enough Code, Quickly · 2024-12-27T21:16:17.169Z · LW · GW

also see Cognitive Load Is What Matters

Comment by Oliver Daniels (oliver-daniels-koch) on leogao's Shortform · 2024-12-27T19:20:23.578Z · LW · GW

Two common failure modes to avoid when doing the legibly impressive things

1. Only caring instrumentally about the project (decreases motivation)

2. Doing "net negative" projects

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-12-27T18:35:59.214Z · LW · GW

Is the move of a lot of alignment discourse to Twitter/X a coordination failure or a positive development?

I'm kinda sad that LW seems less "alive" than it did a few years ago, but also seems healthy to be engaging in a more neutral space with a wider audience

Comment by Oliver Daniels (oliver-daniels-koch) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T18:23:55.780Z · LW · GW

Yeah it does seem unfortunate that there's not a robust pipeline for tackling the "hard problem" (even conditioned to more "moderate" models of x-risk)

But (conditioned on "moderate" models) there's still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the "hard problem of alignment".

Comment by Oliver Daniels (oliver-daniels-koch) on Write Good Enough Code, Quickly · 2024-12-23T16:53:36.193Z · LW · GW

still trying to figure out the "optimal" config setup. The "clean code" method is roughly to have dedicated config files for different components that can be composed and overridden etc (see for example, https://github.com/oliveradk/measurement-pred). But I don't like how far away these configs are from the main code. On the other hand, as the experimental setup gets more mature I often want to toggle across config groups. Maybe the solution is making a "mode" an optional config itself with overrides within the main script

Comment by Oliver Daniels (oliver-daniels-koch) on Write Good Enough Code, Quickly · 2024-12-20T05:06:08.371Z · LW · GW

just read both posts and they're great (as is The Witness). It's funny though, part of me wants to defend OOP - I do think there's something to finding really good abstractions (even preemptively), but that its typically not worth it for self-contained projects with small teams and fixed time horizons (e.g. ML research projects, but also maybe indie games).

Comment by Oliver Daniels (oliver-daniels-koch) on Daniel Kokotajlo's Shortform · 2024-12-17T19:10:04.821Z · LW · GW

The builder-breaker thing isn't unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like "activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization".

Comment by Oliver Daniels (oliver-daniels-koch) on Write Good Enough Code, Quickly · 2024-12-16T00:30:17.127Z · LW · GW

thanks for the detailed (non-ML) example! exactly the kind of thing I'm trying to get at

Comment by Oliver Daniels (oliver-daniels-koch) on Write Good Enough Code, Quickly · 2024-12-16T00:29:12.431Z · LW · GW

Thanks! huh yeah the python interactive windows seems like a much cleaner approach, I'll give it a try

Comment by Oliver Daniels (oliver-daniels-koch) on Write Good Enough Code, Quickly · 2024-12-15T05:36:13.375Z · LW · GW

thanks! yup curser is notebook compatible

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-12-13T17:31:39.966Z · LW · GW

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-12-09T16:15:32.515Z · LW · GW

Thanks!

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-11-21T01:57:35.073Z · LW · GW

I wish there was a bibTeX functionality for alignment forum posts...

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-10-18T17:42:33.558Z · LW · GW

I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:

a. Control seems great

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c. ARC has it covered

But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.

b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)

c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.

Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.

Comment by Oliver Daniels (oliver-daniels-koch) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-21T02:05:01.964Z · LW · GW

(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

I like this terminology and think the community should adopt it

Comment by oliver-daniels-koch on [deleted post] 2024-08-07T05:22:34.921Z

Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g

= input -> output

$(A t t n_{4}^{3} \circ M L P_{2} \circ A t t_{1}^{0})$ = input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output

And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG.

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-06-20T17:50:58.577Z · LW · GW

For anyone trying to replicate / try new methods, I posted a diamonds "pure prediction model" to huggingface https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred, (github repo here: https://github.com/oliveradk/measurement-pred/tree/master)

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-06-05T13:00:42.411Z · LW · GW

just read "Situational Awareness" - it definitely woke me up. AGI is real, and very plausibly (55%?) happening within this decade. I need to stop sleep walking and get serious about contributing within the next two years.

First, some initial thoughts on the essay

Very "epic" and (self?) aggrandizing. If you believe the conclusions, its not unwarranted, but I worry a bit about narratives that satiate some sense of meaning and self-importance. (That counter-reaction is probably much stronger though, and on the margin it seems really valuable to "full-throatily" take on the prospect of AGI within the next 3-5 years)
I think most of my uncertainty lies in the "unhobbling" type algorithmic progress, this seems especially unpredictable, and may require lots of expensive experimentation if e.g. the relevant capabilities to get some meta-cognitive process to train only emerge at a certain scale. I'm vaguely thinking back to Paul's post on self-driving cars and AGI timelines. Maybe this is all priced in though - there's way more research investment, and tech path seems relatively straight forward if we can apply enough experimentation. Still, research is hard, takes a lot of serial time, and is less predictable that e.g. industrial processes. (I'm kind just saying this though, not actually sure how to quantify this, I'm pretty sure people have analsysis of insight generation or whatever, idk...)

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-05-24T00:53:14.778Z · LW · GW

I previously thought the argument for measurement tampering being more tractable then general ELK was mostly about the structural / causal properties of multiple independent measurements, but I think I'm more swayed by the argument that measurement tampering will just be more obvious (both easier to see using interpretability and more anomalous in general) then e.g. sycophancy. This is a flimsier argument though, and is less likely to hold when tampering is more subtle.

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-05-09T14:27:42.998Z · LW · GW

Here's a revised sketch

A few notes:

I use Scalable Oversight to refer to both Alignment and Control
I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don't explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive

Scalable Oversight Taxonomy

Scalable Oversight
- Scalable Alignment
  - Benchmarks / Tasks
    - Sandwiching Experiments (human amateurs + model, gt from human experts)
    - Weak models supervising Strong models
  - Approaches
    - Debate
    - Recursive reward modeling
    - (Solution to Eliciting Latent Knowledge) + Narrow Elicitation
      - (Note - I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
      - Eliciting Latent Knowledge
        Approaches
        Contrast Consistent Search
        Confidence
        Intermediate Probing
        "Speed Prior"
        "Simplicity Prior"
        Concept Extrapolation - learn all salient generalizations, use expensive supervision to select correct one
        IID Mechanistic Anomaly Detection + expensive supervision on anomalies
        Subclasses
        Measurement Tampering Detection
        Approaches
        OOD Mechanistic Anomaly Detection
        In distribution
        Out of Distribution (likely? requires multiple measurment structure)
        Concept Extrapolation
        train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
      - Narrow Elicitation
        ...
- Scalable Control
  - Weak Review
  - Untrusted Rephrase or whatever
  - Coup probes
  - MAD (Review all anomalies)
Trojans
- ...
- MAD (maybe?)
Adversarial Examples
- ...
- MAD (maybe?)
Natural Mechanism Distinction
- MAD
Spurious Correlate Detection / Resolution
- Concept Extrapolation

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-05-09T13:30:39.937Z · LW · GW

I think I'm mostly right, but using a somewhat confused frame.

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-05-09T12:19:26.649Z · LW · GW

One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training but attempt take-over in deployment for the "same reason" (e.g. to maximize paperclips), but a MAD approach that "properly" handles out of distribution cases would not flag take over attempts because we want models to be able to respond to novel situations.

I guess this is part of what motivates measurement tampering as a subclass of ELK - instead of trying to track motivations of the agent as reasons, we try to track the reasons for the measurement predictions, and we have some trusted set with no tampering, where we know the reasons for the measurements is ~exactly that the thing we want to be measuring.

Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-05-08T23:16:35.981Z · LW · GW

Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

measurement tampering detection is a class of weak to strong generalization problems, where the "weak" supervision consists of multiple measurements which are sufficient for supervision in the absence of "tampering" (where tampering is not yet formally defined)

mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for "different reasons" then on a trusted dataset, where "different reasons" are defined w.r.t internal model cognition and structure.

mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)

so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-05-02T20:41:01.553Z · LW · GW

oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-05-01T18:12:12.778Z · LW · GW

is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))

b) AUROC(all(sensor_preds), all(sensors))

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-04-25T12:40:44.908Z · LW · GW

looking at your code - seems like there's an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper - am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-04-24T01:11:53.845Z · LW · GW

yup, sorry for missing that, thanks!

Comment by Oliver Daniels (oliver-daniels-koch) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-04-22T01:47:26.690Z · LW · GW

did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check).

Comment by Oliver Daniels (oliver-daniels-koch) on Aaron_Scher's Shortform · 2024-04-01T02:37:10.341Z · LW · GW

I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).

Comment by Oliver Daniels (oliver-daniels-koch) on Oliver Daniels-Koch's Shortform · 2024-03-17T19:01:13.520Z · LW · GW

(from conversation with Erik Jenner) roughly 3 classes of applications

MTD all the way down
1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the "primary" training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)

(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))

User info

Posts

Comments